## here() starts at /Users/emmajuansalazar/arxius/projects/ge-analysis-bronchial-tissue

1 Introduction

Lung cancer is the leading cause of cancer-related death in the United States. The molecular events preceding the onset of disease are poorly understood, and no effective tools exist to identify smokers with premalignant lesions (PMLs) that will progress to invasive cancer. Previous research has identified molecular changes in the smoke-exposed airways. However, the focus of the paper “Detecting the Presence and Progression of Premalignant Lung Lesions via Airway Gene Expression” Beane et al. (2017) shifts the focus to the first stages of the disease progress. They utilize samples from the same airway field of injury to study PMLs and their potential in preventing lung cancer.

The researchers’ goal is to profile samples extracted from normal appearing bronchial mucosa by mRNA-Seq, and attempt to find differentially expressed gene sets and pathways between subjects with PMLs and without PMLs. From the original 82 samples, 50 subjects had PML, 25 didn’t, and 7 were excluded in the end.

The resulting raw RNA-seq data has been deposited at the Gene Expression Omnibus (GEO), where they are publicly available under accession GSE79209.

In this analysis, we attempted to formulate a new question to answer, using the dataset of the mentioned paper.

We’ll perform an Exploratory Data Analysis of the corresponding RNA-seq gene expression profiles generated as a table of counts using the DEE2 (https://dee2.io) pipeline by Ziemann, Kaspi, and El-Osta (2019), and further packaged into a SummarizedExperiment object with genes mapped to Entrez identifiers. This object also stores the phenotypic information about the profiled samples that has been also made available at GEO.

The EDA will allow us to look further into the dataset and have enough information for a new Experimental Design. Hierarchical clustering and batch identification will assist us in choosing a new way to group the 82 samples, with the goal of finding differentially expressed genes. We will perform functional enrichment analysis to analyse high-throughput experimental results in order to discover biological annotations that are over-represented in the list of DE genes.

2 Quality assessment

2.1 Data import and cleaning

We start by importing the raw table of counts.

library(SummarizedExperiment)

se <- readRDS("data/GSE79209.rds")

We have 25122 genes by 81 samples. From the first row and column names shown by the object, we can figure out that genes are defined by Entrez (Maglott et al. 2010) identifiers and samples by Sequence Read Archive Run (SRR) identifiers.

The row data in this object contains information about the profiled genes.

head(rowData(se))
DataFrame with 6 rows and 5 columns
                  gene_id   gene_biotype            description
              <character>    <character>            <character>
1         ENSG00000121410 protein_coding alpha-1-B glycoprote..
10        ENSG00000156006 protein_coding N-acetyltransferase ..
100       ENSG00000196839 protein_coding adenosine deaminase ..
1000      ENSG00000170558 protein_coding cadherin 2 [Source:H..
10000     ENSG00000117020 protein_coding AKT serine/threonine..
100008586 ENSG00000236362 protein_coding G antigen 12F [Sourc..
             gene_id_version      symbol
                 <character> <character>
1         ENSG00000121410.11        A1BG
10         ENSG00000156006.4        NAT2
100       ENSG00000196839.12         ADA
1000       ENSG00000170558.8        CDH2
10000     ENSG00000117020.16        AKT3
100008586  ENSG00000236362.8     GAGE12F

Among this information, the gene symbol and description are potentially useful for interpreting results of, for instance, a differential expression analysis.

2.1.1 Exploratory Data Analysis

Let’s explore the information we have for the samples: phenotypical and technical data.

Conducting exploratory data analysis (EDA) will enable us to identify the number of phenotypical variables, the technical variables, and potential areas of interest for future experimental design.

dim(colData(se))
[1] 81 68
head(colData(se), n=3)
DataFrame with 3 rows and 68 columns
                            title geo_accession                status
                      <character>   <character>           <character>
SRR3225255  Subject L1, dysplasia    GSM2088002 Public on May 25 2017
SRR3225256 Subject L10, dysplasia    GSM2088003 Public on May 25 2017
SRR3225257 Subject L11, dysplasia    GSM2088004 Public on May 25 2017
           submission_date last_update_date        type channel_count
               <character>      <character> <character>   <character>
SRR3225255     Mar 14 2016      May 15 2019         SRA             1
SRR3225256     Mar 14 2016      May 15 2019         SRA             1
SRR3225257     Mar 14 2016      May 15 2019         SRA             1
              source_name_ch1 organism_ch1    characteristics_ch1
                  <character>  <character>            <character>
SRR3225255 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225256 Bronchial brushing Homo sapiens tissue: Bronchial br..
SRR3225257 Bronchial brushing Homo sapiens tissue: Bronchial br..
           characteristics_ch1.1 characteristics_ch1.2  characteristics_ch1.3
                     <character>           <character>            <character>
SRR3225255               age: 69             Sex: Male smoking status: Form..
SRR3225256               age: 50           Sex: Female smoking status: Curr..
SRR3225257               age: 49             Sex: Male smoking status: Curr..
           characteristics_ch1.4 characteristics_ch1.5  characteristics_ch1.6
                     <character>           <character>            <character>
SRR3225255        pack years: 58  copd status: No COPD max histology: Mild ..
SRR3225256        pack years: 34  copd status: No COPD max histology: Mild ..
SRR3225257        pack years: 35  copd status: No COPD max histology: Moder..
            characteristics_ch1.7  characteristics_ch1.8  characteristics_ch1.9
                      <character>            <character>            <character>
SRR3225255 dysplasia status: Dy.. total alignments: 91.. unique alignments: 8..
SRR3225256 dysplasia status: Dy.. total alignments: 96.. unique alignments: 8..
SRR3225257 dysplasia status: Dy.. total alignments: 12.. unique alignments: 1..
           characteristics_ch1.10 characteristics_ch1.11 characteristics_ch1.12
                      <character>            <character>            <character>
SRR3225255 read 1 alignment: 42.. read 2 alignment: 41.. positive strand alig..
SRR3225256 read 1 alignment: 45.. read 2 alignment: 43.. positive strand alig..
SRR3225257 read 1 alignment: 56.. read 2 alignment: 54.. positive strand alig..
           characteristics_ch1.13 characteristics_ch1.14 characteristics_ch1.15
                      <character>            <character>            <character>
SRR3225255 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225256 negative strand alig.. non-splice alignment.. splice alignments: 2..
SRR3225257 negative strand alig.. non-splice alignment.. splice alignments: 3..
           characteristics_ch1.16 characteristics_ch1.17 characteristics_ch1.18
                      <character>            <character>            <character>
SRR3225255 properly paired alig.. genebody 80/20 ratio.. mean gc content: 53.54
SRR3225256 properly paired alig.. genebody 80/20 ratio.. mean gc content: 45.45
SRR3225257 properly paired alig.. genebody 80/20 ratio.. mean gc content: 44.44
           molecule_ch1   extract_protocol_ch1 extract_protocol_ch1.1
            <character>            <character>            <character>
SRR3225255    polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225256    polyA RNA Bronchial airway bru.. Sequencing libraries..
SRR3225257    polyA RNA Bronchial airway bru.. Sequencing libraries..
             taxid_ch1 description        data_processing
           <character> <character>            <character>
SRR3225255        9606   Sample_L1 Demultiplexing and c..
SRR3225256        9606  Sample_L10 Demultiplexing and c..
SRR3225257        9606  Sample_L11 Demultiplexing and c..
                data_processing.1      data_processing.2      data_processing.3
                      <character>            <character>            <character>
SRR3225255 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225256 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
SRR3225257 The FASTQ files were.. Alignment and qualit.. Gene count estimates..
                data_processing.4  data_processing.5      data_processing.6
                      <character>        <character>            <character>
SRR3225255 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225256 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
SRR3225257 Gene filtering was c.. Genome_build: hg19 Supplementary_files_..
           platform_id data_row_count    instrument_model library_selection
           <character>    <character>         <character>       <character>
SRR3225255    GPL16791              0 Illumina HiSeq 2500              cDNA
SRR3225256    GPL16791              0 Illumina HiSeq 2500              cDNA
SRR3225257    GPL16791              0 Illumina HiSeq 2500              cDNA
           library_source library_strategy               relation
              <character>      <character>            <character>
SRR3225255 transcriptomic          RNA-Seq SRA: https://www.ncb..
SRR3225256 transcriptomic          RNA-Seq SRA: https://www.ncb..
SRR3225257 transcriptomic          RNA-Seq SRA: https://www.ncb..
                       relation.1 supplementary_file_1     age:ch1
                      <character>          <character> <character>
SRR3225255 BioSample: https://w..                 NONE          69
SRR3225256 BioSample: https://w..                 NONE          50
SRR3225257 BioSample: https://w..                 NONE          49
           copd status:ch1 dysplasia status:ch1 genebody 80/20 ratio:ch1
               <character>          <character>              <character>
SRR3225255         No COPD            Dysplasia         1.34213035203572
SRR3225256         No COPD            Dysplasia         1.25508564705863
SRR3225257         No COPD            Dysplasia         1.20885950984169
            max histology:ch1 mean gc content:ch1
                  <character>         <character>
SRR3225255     Mild dysplasia               53.54
SRR3225256     Mild dysplasia               45.45
SRR3225257 Moderate dysplasia               44.44
           negative strand alignments:ch1 non-splice alignments:ch1
                              <character>               <character>
SRR3225255                       41653617                  62567474
SRR3225256                       44222609                  64646471
SRR3225257                       55046872                  79307301
           pack years:ch1 positive strand alignments:ch1
              <character>                    <character>
SRR3225255             58                       41837114
SRR3225256             34                       44366147
SRR3225257             35                       55244102
           properly paired alignments:ch1 read 1 alignment:ch1
                              <character>          <character>
SRR3225255                       66570222             42319429
SRR3225256                       70579486             45257546
SRR3225257                       87409558             56124363
           read 2 alignment:ch1     Sex:ch1 smoking status:ch1
                    <character> <character>        <character>
SRR3225255             41171302        Male      Former smoker
SRR3225256             43331210      Female     Current smoker
SRR3225257             54166611        Male     Current smoker
           splice alignments:ch1         tissue:ch1 total alignments:ch1
                     <character>        <character>          <character>
SRR3225255              20923257 Bronchial brushing             91000281
SRR3225256              23942285 Bronchial brushing             96810722
SRR3225257              30983673 Bronchial brushing            120861113
           unique alignments:ch1
                     <character>
SRR3225255              83490731
SRR3225256              88588756
SRR3225257             110290974

We have a total of 68 phenotypical variables.

To get a sense of what we have, we can look at the column names:

colnames(colData(se))
 [1] "title"                          "geo_accession"                 
 [3] "status"                         "submission_date"               
 [5] "last_update_date"               "type"                          
 [7] "channel_count"                  "source_name_ch1"               
 [9] "organism_ch1"                   "characteristics_ch1"           
[11] "characteristics_ch1.1"          "characteristics_ch1.2"         
[13] "characteristics_ch1.3"          "characteristics_ch1.4"         
[15] "characteristics_ch1.5"          "characteristics_ch1.6"         
[17] "characteristics_ch1.7"          "characteristics_ch1.8"         
[19] "characteristics_ch1.9"          "characteristics_ch1.10"        
[21] "characteristics_ch1.11"         "characteristics_ch1.12"        
[23] "characteristics_ch1.13"         "characteristics_ch1.14"        
[25] "characteristics_ch1.15"         "characteristics_ch1.16"        
[27] "characteristics_ch1.17"         "characteristics_ch1.18"        
[29] "molecule_ch1"                   "extract_protocol_ch1"          
[31] "extract_protocol_ch1.1"         "taxid_ch1"                     
[33] "description"                    "data_processing"               
[35] "data_processing.1"              "data_processing.2"             
[37] "data_processing.3"              "data_processing.4"             
[39] "data_processing.5"              "data_processing.6"             
[41] "platform_id"                    "data_row_count"                
[43] "instrument_model"               "library_selection"             
[45] "library_source"                 "library_strategy"              
[47] "relation"                       "relation.1"                    
[49] "supplementary_file_1"           "age:ch1"                       
[51] "copd status:ch1"                "dysplasia status:ch1"          
[53] "genebody 80/20 ratio:ch1"       "max histology:ch1"             
[55] "mean gc content:ch1"            "negative strand alignments:ch1"
[57] "non-splice alignments:ch1"      "pack years:ch1"                
[59] "positive strand alignments:ch1" "properly paired alignments:ch1"
[61] "read 1 alignment:ch1"           "read 2 alignment:ch1"          
[63] "Sex:ch1"                        "smoking status:ch1"            
[65] "splice alignments:ch1"          "tissue:ch1"                    
[67] "total alignments:ch1"           "unique alignments:ch1"         

The second column geo_accession contains GEO Sample Accession Number (GSM) identifers. GSM identifiers define individual samples, understood in our context as individual sources of RNA. We can check if we have any technical replicates as follows:

length(unique(se$geo_accession))
[1] 81
table(lengths(split(colnames(se), se$geo_accession)))

 1 
81 

So, we have 81 different individual samples which matches the number of samples that we have - 81. Therefore we don’t have any technical replicates and we can proceed with the next step.

To proceed further exploring the dataset, we are going to use the edgeR package and build a DGEList object, incorporating the gene metadata, which includes the gene symbol.

library(edgeR)

dge <- DGEList(counts=assays(se)$counts, genes=rowData(se))
head(dge$samples, 3)
           group lib.size norm.factors
SRR3225255     1 35504068            1
SRR3225256     1 39395250            1
SRR3225257     1 49591671            1
dim(dge)
[1] 25122    81

Calculate \(\log_2\) CPM units of expression and put them as an additional assay element to ease their manipulation. This provides a standarized and normalized representation of gene expression data.

assays(se)$logCPM <- cpm(dge, log=TRUE)
assays(se)$logCPM[1:5, 1:5]
      SRR3225255 SRR3225256 SRR3225257 SRR3225258 SRR3225259
1     -2.8144890 -2.9017808 -4.1164076 -2.4911549  -3.436999
10    -1.8212210 -2.6511767 -3.6837191 -2.1459416  -1.077656
100   -0.0186939  0.9340788  0.2471703 -0.6559511   1.072922
1000   1.4206565  2.2413119  3.0040881  0.5618081   2.296503
10000  1.3426936  2.2490363  2.1152217  2.2466499   1.900561

Let’s explore now some of the phenotypical variables.

To identify the columns that will be useful, we need to examine the number of unique values (levels) of each variable to determine their suitability for grouping samples. Variables with only one unique value or more than 80 unique values are not of interest. Therefore, we aim to create a list of columns with an appropriate number of unique values.

interesting_se <- data.frame(row.names=se$title)
technical_se <- data.frame(row.names=se$title)
for(column in colnames(colData(se))){
  se[[column]] <- factor(se[[column]])
  if(length(levels(se[[column]]))>1 && length(levels(se[[column]]))<81){
    interesting_se[[column]] <- se[[column]]
  } else{
    technical_se[[column]] <- se[[column]]
  }
}

Once we obtain the variables, we can now look at their levels to see what type of data they offer

head(interesting_se)
                       characteristics_ch1.1 characteristics_ch1.2
Subject L1, dysplasia                age: 69             Sex: Male
Subject L10, dysplasia               age: 50           Sex: Female
Subject L11, dysplasia               age: 49             Sex: Male
Subject L12, dysplasia               age: 57             Sex: Male
Subject L13, dysplasia               age: 64           Sex: Female
Subject L14, dysplasia               age: 69             Sex: Male
                                characteristics_ch1.3 characteristics_ch1.4
Subject L1, dysplasia   smoking status: Former smoker        pack years: 58
Subject L10, dysplasia smoking status: Current smoker        pack years: 34
Subject L11, dysplasia smoking status: Current smoker        pack years: 35
Subject L12, dysplasia smoking status: Current smoker        pack years: 40
Subject L13, dysplasia  smoking status: Former smoker     pack years: 37.58
Subject L14, dysplasia smoking status: Current smoker        pack years: 65
                       characteristics_ch1.5             characteristics_ch1.6
Subject L1, dysplasia   copd status: No COPD     max histology: Mild dysplasia
Subject L10, dysplasia  copd status: No COPD     max histology: Mild dysplasia
Subject L11, dysplasia  copd status: No COPD max histology: Moderate dysplasia
Subject L12, dysplasia  copd status: No COPD     max histology: Mild dysplasia
Subject L13, dysplasia  copd status: No COPD     max histology: Mild dysplasia
Subject L14, dysplasia     copd status: COPD max histology: Moderate dysplasia
                             characteristics_ch1.7 characteristics_ch1.18
Subject L1, dysplasia  dysplasia status: Dysplasia mean gc content: 53.54
Subject L10, dysplasia dysplasia status: Dysplasia mean gc content: 45.45
Subject L11, dysplasia dysplasia status: Dysplasia mean gc content: 44.44
Subject L12, dysplasia dysplasia status: Dysplasia mean gc content: 45.45
Subject L13, dysplasia dysplasia status: Dysplasia mean gc content: 47.47
Subject L14, dysplasia dysplasia status: Dysplasia mean gc content: 49.49
                       age:ch1 copd status:ch1 dysplasia status:ch1
Subject L1, dysplasia       69         No COPD            Dysplasia
Subject L10, dysplasia      50         No COPD            Dysplasia
Subject L11, dysplasia      49         No COPD            Dysplasia
Subject L12, dysplasia      57         No COPD            Dysplasia
Subject L13, dysplasia      64         No COPD            Dysplasia
Subject L14, dysplasia      69            COPD            Dysplasia
                        max histology:ch1 mean gc content:ch1 pack years:ch1
Subject L1, dysplasia      Mild dysplasia               53.54             58
Subject L10, dysplasia     Mild dysplasia               45.45             34
Subject L11, dysplasia Moderate dysplasia               44.44             35
Subject L12, dysplasia     Mild dysplasia               45.45             40
Subject L13, dysplasia     Mild dysplasia               47.47          37.58
Subject L14, dysplasia Moderate dysplasia               49.49             65
                       Sex:ch1 smoking status:ch1
Subject L1, dysplasia     Male      Former smoker
Subject L10, dysplasia  Female     Current smoker
Subject L11, dysplasia    Male     Current smoker
Subject L12, dysplasia    Male     Current smoker
Subject L13, dysplasia  Female      Former smoker
Subject L14, dysplasia    Male     Current smoker

Using this list, we identify 8 phenotypical (or environmental) variables of interest:

Age (characteristics_ch1.1)

Sex (characteristics_ch1.2)

Smoking status (characteristics_ch1.3): Current smoker or Former smoker

Pack years (characteristics_ch1.4): how many cigarettes a person has smoked in their lifetime

COPD Status (characteristics_ch1.1): Chronic obstructive pulmonary disease

Mean GC Content (characteristics_ch1.18)

Max Histology (characteristics_ch1.6): Histology types (tissue characteristics)

  • This variable determines what they’ll call dysplasia status (see next variable).
    • Dysplasia includes samples that have:

      • Mild Dysplasia

      • Moderate Dysplasia

      • Severe Dysplasia

    • No Dysplasia includes samples that have:

      • Hyperplasia

      • Normal histology

    • Samples that had metaplasia were excluded from the analysis

Dysplasia Status (characteristics_ch1.7)

  • Dysplasia

  • No Dysplasia

  • Metaplasia

To facilitate handling this variables we are going to perform the following modifications:

se$sex <- gsub("Sex: ", "", se$characteristics_ch1.2)
se$sex <- factor(se$sex, levels = c("Male", "Female"))

se$smoker <- gsub("smoking status: ", "", se$characteristics_ch1.3)
se$smoker <- factor(se$smoker, levels = c("Current smoker", "Former smoker"))

se$copd_status <- gsub("copd status: ", "", se$characteristics_ch1.5)
se$copd_status <- factor(se$copd_status, levels = c("COPD", "No COPD"))

se$histology <- gsub("max histology: ", "", se$characteristics_ch1.6)
se$histology <- factor(se$histology, levels(se$histology) <- c("Hyperplasia", "Metaplasia", "Mild dysplasia", "Moderate dysplasia", "Normal", "Severe dysplasia"))

se$dysplasia <- gsub("Subject L.*, ", "", se$title)
se$dysplasia <- factor(se$dysplasia)

To faciliate working with the pack years variable, we will split it into four categorical groups - Low, Medium, High and Very high.

To do that we will:
1. Calculate quartiles for the data.
2. Assign categories based on quartile ranges.
3. Convert the values into their corresponding categories.

se$characteristics_ch1.4 <- gsub("pack years: ", "", se$characteristics_ch1.4)
class(se$characteristics_ch1.4)
[1] "character"
se$characteristics_ch1.4 <- as.numeric(se$characteristics_ch1.4)

quartiles <- quantile(se$characteristics_ch1.4, probs = c(0.25, 0.5, 0.75))

Now that we have the quartiles, we can proceed to assign categories to the values based on these quartiles.

Values below the first quartile (Q1) will be categorized as “low.” Values between Q1 and the median (Q2) will be categorized as “medium.” Values between the median (Q2) and the third quartile (Q3) will be categorized as “high.” Values above the third quartile (Q3) will be categorized as “very high.”


se$pack_years <- cut(se$characteristics_ch1.4,
                          breaks = c(-Inf, quartiles[1], quartiles[2], quartiles[3], Inf),
                          labels = c("Low", "Medium", "High", "Very High"),
                          include.lowest = TRUE)

We will do the same with the age variable -split it into four intervals.

To do that we will:
1. Calculate quartiles for the data.
2. Assign categories based on quartile ranges.
3. Convert the values into their corresponding categories.


se$age <- gsub("age: ", "", se$characteristics_ch1.1)
class(se$age)
[1] "character"
se$age <- as.numeric(se$age)

quartiles <- quantile(se$age, probs = c(0.25, 0.5, 0.75))
quartiles
25% 50% 75% 
 58  64  69 

Values below the first quartile (Q1) will be categorized as “low.” Values between Q1 and the median (Q2) will be categorized as “medium.” Values between the median (Q2) and the third quartile (Q3) will be categorized as “high.” Values above the third quartile (Q3) will be categorized as “very high.”

se$age_intervals <- cut(se$age,
                          breaks = c(-Inf, quartiles[1], quartiles[2], quartiles[3], Inf),
                          labels = c("age <= 58", "58 < age <= 64", "64 < age <= 69", "69 < age"),
                          include.lowest = TRUE)

In Table ?? below, we show the five variables of interest (smoking status, pack years, histology, dysplasia and COPD status) jointly to gather as much understanding as possible on the underlying experimental design.

2.2 Sequencing depth

In the following step we will evaluate the sequencing depth in terms of the total number of read counts that are mapped to each sample’s genome. The sequencing depth per sample, commonly referred to as library sizes, is displayed in Figure 1 below in increasing order.

Library sizes in increasing order.

Figure 1: Library sizes in increasing order

The plot displays a range of sequencing depths for the samples, each represented by an SRR identifier. The sequencing depth across our samples ranges from 20 to 50 million reads, with the majority falling between 30 to 40 million reads. This indicates a generally high level of coverage, which is beneficial for accurate genomic analysis. The plot also suggests that the highest number of reads corresponds to mild and moderate dysplasia, indicating early stages of cellular abnormalities that could potentially progress to cancer. The predominance of mild dysplasia in these samples could be of particular interest for early detection and intervention studies. There does not appear to be any evident bias in terms of sequencing depth across the conditions represented.

2.3 Distribution of expression levels among samples

Figure 2 below shows the distribution of expression values per sample in logarithmic CPM units of expression.

Non-parametric density distribution of expression profiles per sample.

Figure 2: Non-parametric density distribution of expression profiles per sample

There are no substantial differences between the samples in the distribution of expression values.

2.4 Distribution of expression levels among genes

Let’s calculate now the average expression per gene through all the samples. Figure 3 shows the distribution of those values across genes.

Distribution of average expression level per gene.

Figure 3: Distribution of average expression level per gene

The histogram displays a bimodal distribution, which means there are two distinct peaks. The large peak to the left is the most prominent feature of the histogram, showing that most genes in this dataset are expressed at low levels or possibly not at all, under the conditions of the experiment. On the right part, the peak which indicates that as gene expression levels increase, fewer genes are found at these higher expression levels.

2.5 Filtering of lowly-expressed genes

To filter out lowly-expressed genes, we first calculate the row means of the expression data and apply a logical mask to remove genes with an average logCPM ≤ 1. This step refines the dataset and removes noise, ensuring subsequent analysis is based on high-quality data.

Next, we calculate a CPM cutoff based on the smallest library size. We check for any missing values in the histology data to ensure data integrity.

We then focus on filtering genes based on their presence in samples with premalignant lesions (PML), specifically those categorized as “Mild dysplasia”, “Moderate dysplasia”, and “Severe dysplasia”. We count the number of PML samples and calculate the CPM for these samples only.

Finally, we apply a logical mask to retain genes that meet the CPM cutoff in at least the number of PML samples. This two-step filtering process is crucial for refining the dataset and focusing the analysis on genes with relevant expression patterns. Figure 4 shows the distribution of expression values across genes before and after filtering.

mask_initial <- rowMeans(assays(se)$logCPM) > 1
se_initial_filt <- se[mask_initial, ]
dge_initial_filt <- dge[mask_initial, ]
dim(se_initial_filt)
[1] 13374    81

cpmcutoff <- round(10/min(dge$samples$lib.size/1e6), digits=1)
cpmcutoff
[1] 0.4

if(any(is.na(se$histology))) {
  stop("Missing values found in the histology column.")
}

pml_status <- se$histology %in% c("Mild dysplasia", "Moderate dysplasia", "Severe dysplasia")
nsamplescutoff <- sum(pml_status)
nsamplescutoff
[1] 49

pml_indices <- which(pml_status)

cpm_pml <- cpm(dge_initial_filt)[, pml_indices]

mask_final <- rowSums(cpm_pml > cpmcutoff) >= nsamplescutoff
sum(mask_final)
[1] 13167
se.filt <- se_initial_filt[mask_final, ]
dge.filt <- dge_initial_filt[mask_final, ]
dim(se.filt)
[1] 13167    81
Distribution of lowly-expressed genes.

Figure 4: Distribution of lowly-expressed genes

We are left with 13167 genes.

2.6 Normalization

We calculate the normalization factors on the filtered expression data set.

dge.filt <- calcNormFactors(dge.filt)

Replace the raw log2 CPM units in the corresponding assay element of the SummarizedExperiment object, by the normalized ones.

assays(se.filt)$logCPM <- cpm(dge.filt, log=TRUE,
                              normalized.lib.sizes=TRUE)

2.7 MA-plots

We examine now the MA-plots of the normalized expression profiles in the figures below.

MA-plots of filtered and normalized expression values.

Figure 5: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 6: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 7: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 8: MA-plots of filtered and normalized expression values

MA-plots of filtered and normalized expression values.

Figure 9: MA-plots of filtered and normalized expression values

For genes above the horizontal line at M=0, expression is higher in the condition being tested against the reference, as for genes below, expression is lower. The red line represents a loose fit to the data, which should ideally be around M=0 if there is no differential expression between conditions. Any systematic deviation from M=0 indicates potential bias or systematic variation in the data. In most of the plots, the red line remains close to M=0 across the range of A values, indicating no systematic bias in expression changes relative to gene abundance. Some plots show a slight dip in the loose line at the lower end of the A scale, suggesting a potential bias for lowly expressed genes which could be potential candidates for differential expression. These points would be of interest for follow-up analyses.

2.8 Experimental design and batch identification

2.8.1 Technical variables

To identify potential columns which could constitute unwanted variability, we looked at columns that contain information related to technical aspects, experimental conditions or processing details that could vary between different batches.

When doing EDA, we created a dataframe with all the phenotypical variables, and another one with the technical ones.

head(technical_se, n=3)
                                        title geo_accession
Subject L1, dysplasia   Subject L1, dysplasia    GSM2088002
Subject L10, dysplasia Subject L10, dysplasia    GSM2088003
Subject L11, dysplasia Subject L11, dysplasia    GSM2088004
                                      status submission_date last_update_date
Subject L1, dysplasia  Public on May 25 2017     Mar 14 2016      May 15 2019
Subject L10, dysplasia Public on May 25 2017     Mar 14 2016      May 15 2019
Subject L11, dysplasia Public on May 25 2017     Mar 14 2016      May 15 2019
                       type channel_count    source_name_ch1 organism_ch1
Subject L1, dysplasia   SRA             1 Bronchial brushing Homo sapiens
Subject L10, dysplasia  SRA             1 Bronchial brushing Homo sapiens
Subject L11, dysplasia  SRA             1 Bronchial brushing Homo sapiens
                              characteristics_ch1       characteristics_ch1.8
Subject L1, dysplasia  tissue: Bronchial brushing  total alignments: 91000281
Subject L10, dysplasia tissue: Bronchial brushing  total alignments: 96810722
Subject L11, dysplasia tissue: Bronchial brushing total alignments: 120861113
                              characteristics_ch1.9     characteristics_ch1.10
Subject L1, dysplasia   unique alignments: 83490731 read 1 alignment: 42319429
Subject L10, dysplasia  unique alignments: 88588756 read 1 alignment: 45257546
Subject L11, dysplasia unique alignments: 110290974 read 1 alignment: 56124363
                           characteristics_ch1.11
Subject L1, dysplasia  read 2 alignment: 41171302
Subject L10, dysplasia read 2 alignment: 43331210
Subject L11, dysplasia read 2 alignment: 54166611
                                     characteristics_ch1.12
Subject L1, dysplasia  positive strand alignments: 41837114
Subject L10, dysplasia positive strand alignments: 44366147
Subject L11, dysplasia positive strand alignments: 55244102
                                     characteristics_ch1.13
Subject L1, dysplasia  negative strand alignments: 41653617
Subject L10, dysplasia negative strand alignments: 44222609
Subject L11, dysplasia negative strand alignments: 55046872
                                characteristics_ch1.14
Subject L1, dysplasia  non-splice alignments: 62567474
Subject L10, dysplasia non-splice alignments: 64646471
Subject L11, dysplasia non-splice alignments: 79307301
                            characteristics_ch1.15
Subject L1, dysplasia  splice alignments: 20923257
Subject L10, dysplasia splice alignments: 23942285
Subject L11, dysplasia splice alignments: 30983673
                                     characteristics_ch1.16
Subject L1, dysplasia  properly paired alignments: 66570222
Subject L10, dysplasia properly paired alignments: 70579486
Subject L11, dysplasia properly paired alignments: 87409558
                                       characteristics_ch1.17 molecule_ch1
Subject L1, dysplasia  genebody 80/20 ratio: 1.34213035203572    polyA RNA
Subject L10, dysplasia genebody 80/20 ratio: 1.25508564705863    polyA RNA
Subject L11, dysplasia genebody 80/20 ratio: 1.20885950984169    polyA RNA
                                                                                                                                                                                                                                                                                                                                                                                                         extract_protocol_ch1
Subject L1, dysplasia  Bronchial airway brushings were obtained during autofluorescence bronchoscopy. Autofluorescence bronchoscopy and endobronchial biopsy were used to identify and sample PMLs (if present). The pathological grade of the lesions with the worst histology present was recorded. Total RNA was extracted from bronchial brushings using miRNeasy Mini kit (Qiagen) according to manufacturer's protocol.
Subject L10, dysplasia Bronchial airway brushings were obtained during autofluorescence bronchoscopy. Autofluorescence bronchoscopy and endobronchial biopsy were used to identify and sample PMLs (if present). The pathological grade of the lesions with the worst histology present was recorded. Total RNA was extracted from bronchial brushings using miRNeasy Mini kit (Qiagen) according to manufacturer's protocol.
Subject L11, dysplasia Bronchial airway brushings were obtained during autofluorescence bronchoscopy. Autofluorescence bronchoscopy and endobronchial biopsy were used to identify and sample PMLs (if present). The pathological grade of the lesions with the worst histology present was recorded. Total RNA was extracted from bronchial brushings using miRNeasy Mini kit (Qiagen) according to manufacturer's protocol.
                                                                                                                                                                                                                                                                                                                                                                                      extract_protocol_ch1.1
Subject L1, dysplasia  Sequencing libraries were prepared from total RNA samples using Illumina® TruSeq® RNA Sample Preparation Kit v2. The libraries from individual samples were pooled in groups of four for cluster generation on the Illumina® cBot® using Illumina® TruSeq® Paired-End Cluster Kit. Each sample was sequenced on the Illumina® HiSeq® 2500 to generate paired-end 100 nucleotide reads
Subject L10, dysplasia Sequencing libraries were prepared from total RNA samples using Illumina® TruSeq® RNA Sample Preparation Kit v2. The libraries from individual samples were pooled in groups of four for cluster generation on the Illumina® cBot® using Illumina® TruSeq® Paired-End Cluster Kit. Each sample was sequenced on the Illumina® HiSeq® 2500 to generate paired-end 100 nucleotide reads
Subject L11, dysplasia Sequencing libraries were prepared from total RNA samples using Illumina® TruSeq® RNA Sample Preparation Kit v2. The libraries from individual samples were pooled in groups of four for cluster generation on the Illumina® cBot® using Illumina® TruSeq® Paired-End Cluster Kit. Each sample was sequenced on the Illumina® HiSeq® 2500 to generate paired-end 100 nucleotide reads
                       taxid_ch1 description
Subject L1, dysplasia       9606   Sample_L1
Subject L10, dysplasia      9606  Sample_L10
Subject L11, dysplasia      9606  Sample_L11
                                                                                                       data_processing
Subject L1, dysplasia  Demultiplexing and creation of FASTQ files was performed using Illumina CASAVA v1.8.2 software.
Subject L10, dysplasia Demultiplexing and creation of FASTQ files was performed using Illumina CASAVA v1.8.2 software.
Subject L11, dysplasia Demultiplexing and creation of FASTQ files was performed using Illumina CASAVA v1.8.2 software.
                                                                                                                                                                                                                     data_processing.1
Subject L1, dysplasia  The FASTQ files were aligned to hg19 using TopHat v2.0.4 software and default parameters in addition to the mean insert size and standard deviation determined using MISO based on an initial TopHat alignment.
Subject L10, dysplasia The FASTQ files were aligned to hg19 using TopHat v2.0.4 software and default parameters in addition to the mean insert size and standard deviation determined using MISO based on an initial TopHat alignment.
Subject L11, dysplasia The FASTQ files were aligned to hg19 using TopHat v2.0.4 software and default parameters in addition to the mean insert size and standard deviation determined using MISO based on an initial TopHat alignment.
                                                                                                  data_processing.2
Subject L1, dysplasia  Alignment and quality metrics, including genebody ratio, were calculated using RSeQC v2.3.3.
Subject L10, dysplasia Alignment and quality metrics, including genebody ratio, were calculated using RSeQC v2.3.3.
Subject L11, dysplasia Alignment and quality metrics, including genebody ratio, were calculated using RSeQC v2.3.3.
                                                                                                                            data_processing.3
Subject L1, dysplasia  Gene count estimates were derived using HTSeq-count v0.5.4 with Python v2.7.3 and the Ensembl v64 GTF annotation file.
Subject L10, dysplasia Gene count estimates were derived using HTSeq-count v0.5.4 with Python v2.7.3 and the Ensembl v64 GTF annotation file.
Subject L11, dysplasia Gene count estimates were derived using HTSeq-count v0.5.4 with Python v2.7.3 and the Ensembl v64 GTF annotation file.
                                                                                                                                                                                                        data_processing.4
Subject L1, dysplasia  Gene filtering was conducted on normalized counts per million (cpm) calculated using R v3.0.0 and edgeR v3.4.2 using a modified version of the mixture model in the SCAN.UPC Bioconductor package.
Subject L10, dysplasia Gene filtering was conducted on normalized counts per million (cpm) calculated using R v3.0.0 and edgeR v3.4.2 using a modified version of the mixture model in the SCAN.UPC Bioconductor package.
Subject L11, dysplasia Gene filtering was conducted on normalized counts per million (cpm) calculated using R v3.0.0 and edgeR v3.4.2 using a modified version of the mixture model in the SCAN.UPC Bioconductor package.
                        data_processing.5
Subject L1, dysplasia  Genome_build: hg19
Subject L10, dysplasia Genome_build: hg19
Subject L11, dysplasia Genome_build: hg19
                                                                                                                                                                                                                                                                                                                                                             data_processing.6
Subject L1, dysplasia  Supplementary_files_format_and_content: tab-delimited matrices includes raw count values for each Sample and each filtered Gene, CPM values for each Sample and each filtered Gene, RPKM values for each Sample and each filtered Gene, raw count values for each Sample and each filtered Gene, and RPKM values for each Sample and each filtered Gene
Subject L10, dysplasia Supplementary_files_format_and_content: tab-delimited matrices includes raw count values for each Sample and each filtered Gene, CPM values for each Sample and each filtered Gene, RPKM values for each Sample and each filtered Gene, raw count values for each Sample and each filtered Gene, and RPKM values for each Sample and each filtered Gene
Subject L11, dysplasia Supplementary_files_format_and_content: tab-delimited matrices includes raw count values for each Sample and each filtered Gene, CPM values for each Sample and each filtered Gene, RPKM values for each Sample and each filtered Gene, raw count values for each Sample and each filtered Gene, and RPKM values for each Sample and each filtered Gene
                       platform_id data_row_count    instrument_model
Subject L1, dysplasia     GPL16791              0 Illumina HiSeq 2500
Subject L10, dysplasia    GPL16791              0 Illumina HiSeq 2500
Subject L11, dysplasia    GPL16791              0 Illumina HiSeq 2500
                       library_selection library_source library_strategy
Subject L1, dysplasia               cDNA transcriptomic          RNA-Seq
Subject L10, dysplasia              cDNA transcriptomic          RNA-Seq
Subject L11, dysplasia              cDNA transcriptomic          RNA-Seq
                                                                    relation
Subject L1, dysplasia  SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX1631440
Subject L10, dysplasia SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX1631441
Subject L11, dysplasia SRA: https://www.ncbi.nlm.nih.gov/sra?term=SRX1631442
                                                                           relation.1
Subject L1, dysplasia  BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN04557034
Subject L10, dysplasia BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN04557035
Subject L11, dysplasia BioSample: https://www.ncbi.nlm.nih.gov/biosample/SAMN04556976
                       supplementary_file_1 genebody 80/20 ratio:ch1
Subject L1, dysplasia                  NONE         1.34213035203572
Subject L10, dysplasia                 NONE         1.25508564705863
Subject L11, dysplasia                 NONE         1.20885950984169
                       negative strand alignments:ch1 non-splice alignments:ch1
Subject L1, dysplasia                        41653617                  62567474
Subject L10, dysplasia                       44222609                  64646471
Subject L11, dysplasia                       55046872                  79307301
                       positive strand alignments:ch1
Subject L1, dysplasia                        41837114
Subject L10, dysplasia                       44366147
Subject L11, dysplasia                       55244102
                       properly paired alignments:ch1 read 1 alignment:ch1
Subject L1, dysplasia                        66570222             42319429
Subject L10, dysplasia                       70579486             45257546
Subject L11, dysplasia                       87409558             56124363
                       read 2 alignment:ch1 splice alignments:ch1
Subject L1, dysplasia              41171302              20923257
Subject L10, dysplasia             43331210              23942285
Subject L11, dysplasia             54166611              30983673
                               tissue:ch1 total alignments:ch1
Subject L1, dysplasia  Bronchial brushing             91000281
Subject L10, dysplasia Bronchial brushing             96810722
Subject L11, dysplasia Bronchial brushing            120861113
                       unique alignments:ch1
Subject L1, dysplasia               83490731
Subject L10, dysplasia              88588756
Subject L11, dysplasia             110290974

Here are some columns which we thought could introduce unwanted variability: submission_date, last_update_date, type, channel_count, extract_protocol_ch1, extract_protocol_ch1.1, data_processing, data_processing.1, data_processing.2, data_processing.3, data_processing.4, data_processing.5, data_processing.6, platform_id, instrument_model, library_selection, library_source, library_strategy.

These variables have the same values across all samples which suggests that they are not contributing to the variability of the dataset. For example:

table(se.filt$histology, se.filt$library_strategy)
                    
                     RNA-Seq
  Hyperplasia             13
  Metaplasia               7
  Mild dysplasia          35
  Moderate dysplasia      11
  Normal                  12
  Severe dysplasia         3

There is perfect correlation between library strategy and histology. This means that differences between our samples will be due to biological differences rather than technical.

We can discard all of the variables that have the same value across all samples:

deleted <- c()
for(column in colnames(technical_se)){
  if(length(levels(technical_se[[column]]))==1){
    technical_se[[column]] <- NULL
    deleted <- c(deleted, c(column))
  }
}
head(deleted)
[1] "status"           "submission_date"  "last_update_date" "type"            
[5] "channel_count"    "source_name_ch1" 

The rest of the variables are related to technical aspects related to the sequencing that’s specific to every sample, therefore we can discard them as sources of unwanted variability.

2.8.2 Phenotypical Variables

We have to now take a look at how our phenotypical variables interact and whether they might explain any sample clustering we may find, to help us make a decision on what groups to choose for our DE analysis.

First, let’s examine the distribution between dysplasia and histology.

table(se.filt$dysplasia, se.filt$histology)
              
               Hyperplasia Metaplasia Mild dysplasia Moderate dysplasia Normal
  dysplasia              0          0             35                 11      0
  metaplasia             0          7              0                  0      0
  no dysplasia          13          0              0                  0     12
              
               Severe dysplasia
  dysplasia                   3
  metaplasia                  0
  no dysplasia                0

This is a very obvious result, since we know dysplasia was drawn from the histology variable. This is a great example of combinations that aren’t sequenced (in this case, because they can’t exist.)

We can also examine the distribution between histology and our other prototypical variables

table(se.filt$histology, se.filt$age_intervals)
                    
                     age <= 58 58 < age <= 64 64 < age <= 69 69 < age
  Hyperplasia                3              4              3        3
  Metaplasia                 1              4              1        1
  Mild dysplasia            15              5              8        7
  Moderate dysplasia         3              2              3        3
  Normal                     2              3              3        4
  Severe dysplasia           0              1              1        1
table(se.filt$histology, se.filt$sex)
                    
                     Male Female
  Hyperplasia           9      4
  Metaplasia            3      4
  Mild dysplasia       22     13
  Moderate dysplasia    9      2
  Normal                7      5
  Severe dysplasia      3      0
table(se.filt$histology, se.filt$pack_years)
                    
                     Low Medium High Very High
  Hyperplasia          5      3    1         4
  Metaplasia           2      1    2         2
  Mild dysplasia      12      7    9         7
  Moderate dysplasia   1      5    2         3
  Normal               1      4    4         3
  Severe dysplasia     0      0    2         1
table(se.filt$histology, se.filt$copd_status)
                    
                     COPD No COPD
  Hyperplasia           4       9
  Metaplasia            2       5
  Mild dysplasia       10      25
  Moderate dysplasia    5       6
  Normal                1      11
  Severe dysplasia      2       1

We can see that not all combinations of pack-years and histology have been sequenced. Some combinations have missing values, indicated by 0s.

It’s interesting to note that the 0s indicate:

  • Groups like low or medium pack years (people that have smoked less cigarettes over the course of their lives) have no samples with severe dysplasia, meaning severe detriment to their bronchial tissue.

  • There’s also no females with severe dysplasia.

  • And the severe dysplasia group also lacks samples on the younger age interval.

The distribution we’re interested in, however, is the one between these phenotypical variables and the dysplasia variable (same concept as histology, but grouped in three groups)

table(se.filt$dysplasia, se.filt$age_intervals)
              
               age <= 58 58 < age <= 64 64 < age <= 69 69 < age
  dysplasia           18              8             12       11
  metaplasia           1              4              1        1
  no dysplasia         5              7              6        7
table(se.filt$dysplasia, se.filt$sex)
              
               Male Female
  dysplasia      34     15
  metaplasia      3      4
  no dysplasia   16      9
table(se.filt$dysplasia, se.filt$pack_years)
              
               Low Medium High Very High
  dysplasia     13     12   13        11
  metaplasia     2      1    2         2
  no dysplasia   6      7    5         7
table(se.filt$dysplasia, se.filt$copd_status)
              
               COPD No COPD
  dysplasia      17      32
  metaplasia      2       5
  no dysplasia    5      20

We can see that we have sequences for all of these different combinations.

2.8.3 Hierarchical Clustering

We examine now how samples group together by hierarchical clustering and multidimensional scaling, annotating pack-years and histology. We calculate again log CPM values with a high prior count (3) to moderate extreme fold-changes produced by low counts. The resulting dendrogram is shown in Figure 10.

Figure S6: Hierarchical clustering of the samples. Labels correspond to pack-years and sample identifer, while colors indicate histology groups.

Figure 10: Figure S6: Hierarchical clustering of the samples
Labels correspond to pack-years and sample identifer, while colors indicate histology groups.

Samples are linked together based on their similarity, with the branches connecting those most similar first. As we move up the tree, the connections represent progressively less similar groups until all are joined in a single cluster. Samples do not really cluster together by histology type, which means histology type is not the primary factor driving the similarity or dissimilarity between samples.

If we look more closely at the labels, we can also see even if we look at clustering through the 3 dysplasia groups instead of all 6 histology groups, there is still no clear clustering.

We can now try to identify a phenotypical variable that will justify clustering.

Figure S6: Hierarchical clustering of the samples. Labels correspond to dysplasia status, while colors indicate age groups.

Figure 11: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate age groups.

Figure S6: Hierarchical clustering of the samples. Labels correspond to dysplasia status, while colors indicate sex groups.

Figure 12: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate sex groups.

Figure S6: Hierarchical clustering of the samples. Labels correspond to dysplasia status, while colors indicate COPD groups.

Figure 13: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate COPD groups.

Figure S6: Hierarchical clustering of the samples. Labels correspond to dysplasia status, while colors indicate smoking status groups.

Figure 14: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate smoking status groups.

Figure S6: Hierarchical clustering of the samples. Labels correspond to dysplasia status, while colors indicate pack-years groups.

Figure 15: Figure S6: Hierarchical clustering of the samples
Labels correspond to dysplasia status, while colors indicate pack-years groups.

No clustering is evident in any of the resulting plots except for Smoking Status, where we can see some clustering between the two categories: Current smoker vs. Former smoker.

In Figure 16 we show the corresponding MDS plot for the Smoking Status variable.

Figure S7: Multidimensional scaling plot of the samples. Labels correspond to treatment and colors indicate sample group.

Figure 16: Figure S7: Multidimensional scaling plot of the samples
Labels correspond to treatment and colors indicate sample group.

We can distinguish more clearly that there are indeed two clusters, one for Current smokers and another one for Former smokers. This leads us to assume these two groups have significant differences in their gene expression.

2.9 Formulating the question to answer

Our Exploratory Data Analysis and new Experimental Design allowed us to identify a variable that raises interesting questions when faced with the prospect of investigating it: Smoking status.

All of the samples in our dataset have either been or currently are smokers. Based on the results we’ve obtained on this first part of our project, we’ve decided to use this variable to design our new groups. We obtain: Current Smokers (n = r length(se.filt\(smoker[se.filt\)smoker == “Current smoker”])) and Former smokers (n = r length(se.filt\(smoker[se.filt\)smoker == “Former smoker”])).

We’re also interested in looking at another variable, seeing if it has statistically significant results, to be able to compare its findings to the Smoking status analysis. Based on our clustering, we have identified Dysplasia as a variable that clusters. Dysplasia, which is based on Histology level, assessing the damage in the tissue. It is relevan to see how Current vs. Former smokers differ and also which processes differentially involved in one of these groups is also related to airway tissue damage.

Our goal is to find differentially expressed gene sets and pathways between these groups. To do so, we’ll perform Differential Expression Analysis with Linear Regression models, using Bioconductor’s limma pipeline Smyth (2004) Matthew E. Ritchie (2015). The results will be used to perform Functional Analysis using the Bioconductor package GOstats (Gene Ontology), and finally we’ll perform Gene Set Enrichment Analysis to obtain our final results.

These results will help us gain an insight into How the genome changes once a person stops smoking, and wether or not pathways invoved in this are also involved in damage to airway tissue.

From hereon, every step will be done once for the Dysplasia grouping (samples with PML vs without) and then for the Smoking status grouping, to obtain our new results.

3 Differential expression

The next step is to do a Differential Expression Analysis. At first we’ll simply use fold-change values to get a first glance at differentially expressed genes, and we’ll follow that by DE through linear regression. We chose to do two separate DE Analysis, based on the following variables:

PML (Premalignant Lesions)

Smoking status

3.1 Fold Change to Analyze Differential Expression

Firstly, a simple way to find differentially expressed genes is to simply compare expression levels between the two groups of samples.

We can do this using fold-change, which compares the mean read counts (using the logarithmic scale to try and stabilize variability across expression levels).

We’ll see how many genes change their expression between the two groups using two different classifications.

3.1.1 Smoking status

Our two groups of samples are: samples of Current Smokers and samples of Former Smokers. We calculate the means of the values of the expression levels.

current <- rowMeans(logCPM[, se.filt$smoker=="Current smoker"])
former <- rowMeans(logCPM[, se.filt$smoker=="Former smoker"])

We can now calculate fold-change values for each gene in these two groups. We’ll rank the genes by decreasing absolute value of Fold Change, and we’ll call differentially expressed genes those that are at the top 10.

log2fc_smoker <- current - former
ranking <- order(abs(log2fc_smoker), decreasing=TRUE)
fold_change_smoker <- data.frame(EntrezID=names(log2fc_smoker[ranking]),
                                    Log2FC=round(log2fc_smoker[ranking], digits=3),
                                    FC=round(2^log2fc_smoker[ranking], digits=3),
                                    "1/FC"=round(2^(-log2fc_smoker[ranking]), digits=3),
                                    row.names=rowData(se.filt)$symbol[ranking],
                                    check.names=FALSE)
de_fc_smoker <- head(fold_change_smoker, n=10)
de_fc_smoker
        EntrezID Log2FC     FC  1/FC
AKR1B10    57016  4.732 26.584 0.038
UCHL1       7345  4.439 21.688 0.046
CYP1B1      1545  4.030 16.332 0.061
BPIFB2     80341  3.842 14.343 0.070
SPP1        6696  3.496 11.286 0.089
SLC7A11    23657  3.362 10.281 0.097
ALDH3A1      218  3.285  9.749 0.103
CABYR      26256  2.991  7.950 0.126
AKR1C2      1646  2.942  7.686 0.130
H19       283120  2.882  7.372 0.136

In the resulting Data Frame:

  • If a gene has a positive Log2FC value, it means said gene is FC times more upregulated in samples that are Current smokers compared to in samples that are Former smokers.

    • For example, AKR1B10 is 26 times more upregulated in samples that are Current smokers than in samples that are Former smokers
  • If a gene has a negative Log2FC value, it means said gene is 1/FC times more upregulated in samples that are Former smokers compared to in samples that are Current smokers.

We can clearly see the difference of expression between groups is larger with the smoking status distinction than the dysplasia one.

3.1.2 PML (Dysplasia)

The two main groups we want to contrast are samples with dysplasia and samples without it. That is why our outcome target is Dysplasia.

no_dysplasia <- rowMeans(logCPM[, se.filt$dysplasia=="no dysplasia"])
dysplasia <- rowMeans(logCPM[, se.filt$dysplasia=="dysplasia"])

We can now calculate fold-change values for each gene in these two groups.

log2fc_dysplasia <- dysplasia - no_dysplasia
ranking <- order(abs(log2fc_dysplasia), decreasing=TRUE)
fold_change_dysplasia <- data.frame(EntrezID=names(log2fc_dysplasia[ranking]),
                                    Log2FC=round(log2fc_dysplasia[ranking], digits=3),
                                    FC=round(2^log2fc_dysplasia[ranking], digits=3),
                                    "1/FC"=round(2^(-log2fc_dysplasia[ranking]), digits=3),
                                    row.names=rowData(se.filt)$symbol[ranking],
                                    check.names=FALSE)
de_fc_dysplasia <- head(fold_change_dysplasia, n=10)
de_fc_dysplasia
            EntrezID Log2FC    FC  1/FC
CLDN18         51208 -1.040 0.486 2.056
MROH7         374977 -1.031 0.490 2.043
AGAP12P       414224 -0.975 0.509 1.965
FAM95C     100289137 -0.910 0.532 1.880
FAM238C       387644 -0.902 0.535 1.869
AC090826.1    729739 -0.896 0.537 1.861
LRP2            4036 -0.891 0.539 1.855
CAPN3            825 -0.885 0.541 1.847
AOC3            8639 -0.875 0.545 1.834
NPIPB3         23117 -0.848 0.556 1.800

In the resulting Data Frame:

  • If a gene has a positive Log2FC value, it means said gene is FC times more upregulated in samples with Dysplasia compared to in samples without Dysplasia.

  • If a gene has a negative Log2FC value, it means said gene is 1/FC times more upregulated in patients without Dysplasia compared to in patients with Dysplasia

    • For example, CLDN18 is more upregulated in samples with Dysplasia than in samples without Dysplasia

3.2 Linear Regression to analyze Differential Expression

We will now use Linear Regression Models to perform our Differential Expression, using the limma pipeline.

library(limma)

We’ll use the same two groupings, and use the followings steps:

  1. Build our design matrix
  2. Fit the linear model
  3. Calculate moderated t-statistics
  4. And finally we’ll output the results, and get our list of differentially expressed Genes.

3.2.1 Smoking status

We will start with smoking status since the fold-change values are higher and it probably means we’ll need less adjusting during the process (and it’ll be simpler).

First, we build our design matrix.

mod <- model.matrix(~ smoker, data=colData(se.filt))
head(mod)
           (Intercept) smokerFormer smoker
SRR3225255           1                   1
SRR3225256           1                   0
SRR3225257           1                   0
SRR3225258           1                   0
SRR3225259           1                   1
SRR3225260           1                   0

Next, we fit the linear regression model

fit <- lmFit(assays(se.filt)$logCPM, mod)

We calculate moderated t-statistics

fit <- eBayes(fit)

A quick overview of the results for FDR p-value cutoff at 5% (default)

res <- decideTests(fit)
summary(res)
       (Intercept) smokerFormer smoker
Down             0                3394
NotSig           0                6125
Up           13167                3648

summary(res)[1,2] + summary(res)[3,2]
[1] 7042

We have 3394 downregulated genes, and 3648 upregulated genes, a total of 7042.

Seeing these results, we can afford to be a little more restricting with the FDR cut-off. We’ll fetch the results with a p-value cut-off of 0,001%.

res <- decideTests(fit, p.value=0.00001)
summary(res)
       (Intercept) smokerFormer smoker
Down             0                 805
NotSig           7               11806
Up           13160                 556

summary(res)[1,2] + summary(res)[3,2]
[1] 1361

We now have 805 downregulated genes, and 556 upregulated genes, a total of 1361.

And a more detailed view of the results (we’ll be sure to add more information about the genes, so as to not only have the EntrezID identifier as rowname).

The entire table, with information about all the genes, sorted by p-value:

genesmd <- data.frame(chr=as.character(seqnames(rowRanges(se.filt))),
                      symbol=rowData(se.filt)[,5],
                      stringsAsFactors=FALSE)
gene_id=rowData(se.filt)[,1]
fit$genes <- genesmd
  
tt_smoker <- topTable(fit, coef=2, n=Inf)
tt_smoker <- tt_smoker[order(tt_smoker$adj.P.Val),]
dim(tt_smoker)
[1] 13167     8
head(tt_smoker)
      chr  symbol     logFC  AveExpr         t      P.Value    adj.P.Val
218    17 ALDH3A1 -3.285348 9.983528 -11.59737 6.093143e-19 8.022842e-15
10553  11 HTATIP2 -1.015938 5.636573 -11.33322 1.960761e-18 1.037547e-14
4199    6     ME1 -2.063224 4.596722 -11.29109 2.363971e-18 1.037547e-14
26256  18   CABYR -3.002625 3.221512 -11.14008 4.627564e-18 1.315031e-14
1728   16    NQO1 -1.812183 9.108047 -11.12299 4.993663e-18 1.315031e-14
7345    4   UCHL1 -4.451732 4.192392 -10.92192 1.225474e-17 2.095782e-14
             B
218   32.59231
10553 31.45628
4199  31.27444
26256 30.62121
1728  30.54715
7345  29.67372

But our goal is to get a list of DE genes, so, finally, we’ll fetch the genes called DE with FDR <0,1%

DEgenes_smoker <- rownames(tt_smoker)[tt_smoker$adj.P.Val < 0.00001]
length(DEgenes_smoker)
[1] 1361

We’ll now plot the distribution of p-values and moderated t-statistics (Figure 17)

par(mfrow=c(1,2), mar=c(4, 5, 2, 2))
hist(tt_smoker$P.Value, xlab="Raw P-values", main="", las=1)
qqt(fit$t[, 2], df=fit$df.prior+fit$df.residual, main="", pch=".", cex=3)
abline(0, 1, lwd=2)
Distribution of p-values and moderated t-statistics on every gene between samples with that are Current smokers vs. samples that are former smokers.

Figure 17: Distribution of p-values and moderated t-statistics on every gene between samples with that are Current smokers vs. samples that are former smokers

Another relevant one is the volcano plot (Figure ??).

volcanoplot(fit, coef=2, highlight=7, names=fit$genes$symbol, las=1)
P-value diagnostic plot for differental expression between samples that are Current smokers and samples that are Former smokers

Figure 18: P-value diagnostic plot for differental expression between samples that are Current smokers and samples that are Former smokers

We can see from the plots our results are highly relevant and statistically significant.

3.2.2 PML (Dysplasia)

Before we start, we have to exclude the metaplasia group, seeing as it was excluded in the paper and we need two groups only, dysplasia vs. no dysplasia.

se.filt <- se.filt[,se.filt$dysplasia!="metaplasia"]
se.filt$dysplasia <- factor(se.filt$dysplasia, levels = c("dysplasia", "no dysplasia"))
dim(se.filt)
[1] 13167    74

First, we build our design matrix.

mod <- model.matrix(~ dysplasia, data=colData(se.filt))
head(mod)
           (Intercept) dysplasiano dysplasia
SRR3225255           1                     0
SRR3225256           1                     0
SRR3225257           1                     0
SRR3225258           1                     0
SRR3225259           1                     0
SRR3225260           1                     0

Next, we fit the linear regression model

fit <- lmFit(assays(se.filt)$logCPM, mod)

We calculate moderated t-statistics

fit <- eBayes(fit)

A quick overview of the results for FDR p-value cutoff at 5% (default)

res <- decideTests(fit)
summary(res)
       (Intercept) dysplasiano dysplasia
Down             0                  1499
NotSig           0                  9950
Up           13167                  1718

summary(res)[1,2] + summary(res)[3,2]
[1] 3217

We have 1499 downregulated genes, and 1718 upregulated genes, a total of 3217.

Seeing these results, we can afford to be a little more restricting with the FDR cut-off. We’ll fetch the results with a p-value cut-off of 1%.

res <- decideTests(fit, p.value=0.01)
summary(res)
       (Intercept) dysplasiano dysplasia
Down             0                   548
NotSig           0                 11850
Up           13167                   769

summary(res)[1,2] + summary(res)[3,2]
[1] 1317

We now have 548 downregulated genes, and 769 upregulated genes, a total of 1317.

And a more detailed view of the results (we’ll be sure to add more information about the genes, so as to not only have the EntrezID identifier as rowname).

The entire table, with information about all the genes, sorted by p-value:

genesmd <- data.frame(chr=as.character(seqnames(rowRanges(se.filt))),
                      symbol=rowData(se.filt)[,5],
                      stringsAsFactors=FALSE)
gene_id=rowData(se.filt)[,1]
fit$genes <- genesmd
  
tt_dysplasia <- topTable(fit, coef=2, n=Inf)
tt_dysplasia <- tt_dysplasia[order(tt_dysplasia$adj.P.Val),]
dim(tt_dysplasia)
[1] 13167     8
head(tt_dysplasia)
          chr     symbol      logFC  AveExpr         t      P.Value
26505       2      CNNM3  0.3315382 5.375215  5.865796 1.147186e-07
101927550   7 AC004893.2  0.4896133 1.522941  5.805219 1.471928e-07
23334       1       SZT2  0.4180050 5.864061  5.710929 2.165960e-07
7580       10      ZNF32 -0.2577731 3.601835 -5.622424 3.106504e-07
825        15      CAPN3  0.8951923 1.604204  5.570312 3.837861e-07
29123      16    ANKRD11  0.2243954 6.929926  5.541201 4.317646e-07
             adj.P.Val        B
26505     0.0008456886 7.407644
101927550 0.0008456886 7.177840
23334     0.0008456886 6.821753
7580      0.0008456886 6.489386
825       0.0008456886 6.294578
29123     0.0008456886 6.186050

But our goal is to get a list of DE genes, so, finally, we’ll fetch the genes called DE with FDR <1%

DEgenes_dysplasia <- rownames(tt_dysplasia)[tt_dysplasia$adj.P.Val < 0.01]
length(DEgenes_dysplasia)
[1] 1317

We’ll now plot the distribution of p-values and moderated t-statistics (Figure 19)

par(mfrow=c(1,2), mar=c(4, 5, 2, 2))
hist(tt_dysplasia$P.Value, xlab="Raw P-values", main="", las=1)
qqt(fit$t[, 2], df=fit$df.prior+fit$df.residual, main="", pch=".", cex=3)
abline(0, 1, lwd=2)
Distribution of p-values and moderated t-statistics on every gene between samples with PML vs. without PML.

Figure 19: Distribution of p-values and moderated t-statistics on every gene between samples with PML vs. without PML

Another relevant one is the volcano plot (Figure ??).

volcanoplot(fit, coef=2, highlight=7, names=fit$genes$symbol, las=1)
P-value diagnostic plot for differental expression between samples with Dysplasia and samples without Dysplasia

Figure 20: P-value diagnostic plot for differental expression between samples with Dysplasia and samples without Dysplasia

We can see from the plots our results are highly relevant and statistically significant.

4 Functional analysis

Now, let’s attempt to find DE pathways, as DE gene sets, by detecting the functional enrichment.

4.1 Gene Ontology

Firstly, we will attempt to do it with the Gene Ontology (GO) analysis which corresponds to applying the one-tailed Fisher’s exact test to every gene set in the GO database. To execute this, we will use the Bioconductor package GOstats.

The first step is to create our gene universe.

library(GOstats)
library(org.Hs.eg.db)

geneUniverse <- rownames(se.filt)

4.1.1 Functional Analysis for Smoking Status

Let’s build the parameter object and use the conditional test which computes the significance of a GO term conditional on the significance of its children.

paramsSmokerStatus <- new("GOHyperGParams", geneIds=DEgenes_smoker,    
                 universeGeneIds=geneUniverse,
                 annotation="org.Hs.eg.db", ontology="BP",
                 pvalueCutoff=0.05, testDirection="over")

conditional(paramsSmokerStatus) <- TRUE

The next step is to run the functional enrichment analysis.

hgOverCondSmokerStatus <- hyperGTest(paramsSmokerStatus)
hgOverCondSmokerStatus
Gene to GO BP Conditional test for over-representation 
7785 GO BP ids tested (339 have p < 0.05)
Selected gene set size: 1233 
    Gene universe size: 11592 
    Annotation package: org.Hs.eg 

And lastly to store and visualize the results. The resHgOverCond data.frame object contacts the results of the hypothesis test for each GO term, satisfying the P-value cutoff.

resHgOverCondSmokerStatus <- summary(hgOverCondSmokerStatus)
dim(resHgOverCondSmokerStatus)
[1] 339   7
head(resHgOverCondSmokerStatus)
      GOBPID       Pvalue OddsRatio   ExpCount Count Size
1 GO:0006890 1.236820e-06  4.935723   5.211957    18   49
2 GO:0006888 1.984599e-05  2.743511  12.232143    28  115
3 GO:0045454 6.446669e-05  4.588627   3.935559    13   37
4 GO:0097237 1.169632e-04  2.715806  10.104814    23   95
5 GO:0043436 2.293796e-04  1.525248  69.457298    98  653
6 GO:0042592 2.386685e-04  1.395609 121.789596   158 1145
                                                                   Term
1 retrograde vesicle-mediated transport, Golgi to endoplasmic reticulum
2             endoplasmic reticulum to Golgi vesicle-mediated transport
3                                                cell redox homeostasis
4                                  cellular response to toxic substance
5                                             oxoacid metabolic process
6                                                   homeostatic process

ifelse(!dir.exists(file.path("./results")),
        dir.create(file.path("./results")),
        "Directory Exists")
[1] "Directory Exists"

htmlReport(hgOverCondSmokerStatus, file="results/gotests_smoker_status.html")
browseURL("results/gotests_smoker_status.html")

Check out the result here.

We do not want our GO terms to be involving too few genes or too many genes. That would make them too unreliable. So let’s filter the previous results with a minimum size of 3, a maximum size of 300 and a minimum count of 3. Then we order them by the OddsRatio column.

resHgOverCondSmokerStatus <- resHgOverCondSmokerStatus[!is.infinite(resHgOverCondSmokerStatus$OddsRatio), ]
mask <- resHgOverCondSmokerStatus$Size >= 3 & resHgOverCondSmokerStatus$Size <= 300 & resHgOverCondSmokerStatus$Count >= 3
resHgOverCondSmokerStatus <- resHgOverCondSmokerStatus[mask, ]
dim(resHgOverCondSmokerStatus)
[1] 249   7

ord <- order(resHgOverCondSmokerStatus$OddsRatio, decreasing=TRUE)
resHgOverCondSmokerStatus <- resHgOverCondSmokerStatus[ord, ]
head(resHgOverCondSmokerStatus)
       GOBPID       Pvalue OddsRatio  ExpCount Count Size
13 GO:0015827 0.0005831661  33.71196 0.5318323     4    5
40 GO:0015829 0.0044208430  25.26341 0.4254658     3    4
41 GO:0018197 0.0044208430  25.26341 0.4254658     3    4
24 GO:0016115 0.0016016227  16.85435 0.6381988     4    6
11 GO:0044597 0.0005740280  14.05537 0.8509317     5    8
12 GO:0044598 0.0005740280  14.05537 0.8509317     5    8
                                  Term
13                tryptophan transport
40                    valine transport
41 peptidyl-aspartic acid modification
24         terpenoid catabolic process
11      daunorubicin metabolic process
12       doxorubicin metabolic process

Using the geneIdsByCategory() method, we extract the genes that enrich each GO term and paste it to the result.

geneIDs <- geneIdsByCategory(hgOverCondSmokerStatus)[resHgOverCondSmokerStatus$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresultsSmokerStatus <- cbind(resHgOverCondSmokerStatus, Genes=geneSYMs)
rownames(goresultsSmokerStatus) <- 1:nrow(goresultsSmokerStatus)

We generate an HTML page using the knitr and kableExtra packages.

See the HTML page by clicking on this link.

Table 1 displays the top 10 Go results (sorted by Odds Ratio).

Table 1: GO Enrichment Smoking Status
Pvalue OddsRatio ExpCount Count Size Term
1 0.0005832 33.71196 0.5318323 4 5 tryptophan transport
2 0.0044208 25.26341 0.4254658 3 4 valine transport
3 0.0044208 25.26341 0.4254658 3 4 peptidyl-aspartic acid modification
4 0.0016016 16.85435 0.6381988 4 6 terpenoid catabolic process
5 0.0005740 14.05537 0.8509317 5 8 daunorubicin metabolic process
6 0.0005740 14.05537 0.8509317 5 8 doxorubicin metabolic process
7 0.0101774 12.63049 0.5318323 3 5 sinoatrial node development
8 0.0101774 12.63049 0.5318323 3 5 taurine transport
9 0.0101774 12.63049 0.5318323 3 5 isoleucine transport
10 0.0101774 12.63049 0.5318323 3 5 thyroid hormone transport

4.1.2 Functional Analysis for PML (Dysplasia)

Build the parameter object and use the conditional test.

paramsDysplasia <- new("GOHyperGParams", geneIds=DEgenes_dysplasia,    
                 universeGeneIds=geneUniverse,
                 annotation="org.Hs.eg.db", ontology="BP",
                 pvalueCutoff=0.05, testDirection="over")

conditional(paramsDysplasia) <- TRUE

Run the functional enrichment analysis.

hgOverCondDysplasia <- hyperGTest(paramsDysplasia)
hgOverCondDysplasia
Gene to GO BP Conditional test for over-representation 
6454 GO BP ids tested (157 have p < 0.05)
Selected gene set size: 1136 
    Gene universe size: 11592 
    Annotation package: org.Hs.eg 

Store and visualize the results.

resHgOverCondDysplasia <- summary(hgOverCondDysplasia)
dim(resHgOverCondDysplasia)
[1] 157   7
head(resHgOverCondDysplasia)
      GOBPID       Pvalue OddsRatio  ExpCount Count Size
1 GO:0002181 8.862844e-50 21.497819  10.52792    74  108
2 GO:0043604 2.182958e-29  3.195909  68.79503   167  702
3 GO:0006518 5.111718e-27  3.018487  71.24500   166  727
4 GO:0034641 2.890271e-18  1.748412 419.50341   553 4328
5 GO:1901566 3.312695e-16  2.076009 115.41480   200 1195
6 GO:0022904 6.184984e-13  5.134895  10.58385    38  108
                                          Term
1                      cytoplasmic translation
2                   amide biosynthetic process
3                    peptide metabolic process
4 cellular nitrogen compound metabolic process
5 organonitrogen compound biosynthetic process
6         respiratory electron transport chain

htmlReport(hgOverCondDysplasia, file="./results/gotests_dysplasia.html")
browseURL("results/gotests_dysplasia.html")

Check out the result here.

Filter the previous results with a minimum size of 3, a maximum size of 300 and a minimum count of 3 and order them by the OddsRatio column.

resHgOverCondDysplasia <- resHgOverCondDysplasia[!is.infinite(resHgOverCondDysplasia$OddsRatio), ]
mask <- resHgOverCondDysplasia$Size >= 3 & resHgOverCondDysplasia$Size <= 300 & resHgOverCondDysplasia$Count >= 3
resHgOverCondDysplasia <- resHgOverCondDysplasia[mask, ]
dim(resHgOverCondDysplasia)
[1] 107   7

ord <- order(resHgOverCondDysplasia$OddsRatio, decreasing=TRUE)
resHgOverCondDysplasia <- resHgOverCondDysplasia[ord, ]
head(resHgOverCondDysplasia)
       GOBPID       Pvalue OddsRatio   ExpCount Count Size
30 GO:0036261 2.058449e-05  27.75398  0.7839890     6    8
61 GO:2000435 3.480267e-03  27.68314  0.3919945     3    4
1  GO:0002181 8.862844e-50  21.49782 10.5279197    74  108
13 GO:0006123 5.424236e-08  17.02963  1.6659765    11   17
73 GO:2000627 8.066023e-03  13.84025  0.4899931     3    5
57 GO:0015986 2.357034e-03  12.55696  0.6739526     4    7
                                                       Term
30                   7-methylguanosine cap hypermethylation
61               negative regulation of protein neddylation
1                                   cytoplasmic translation
13 mitochondrial electron transport, cytochrome c to oxygen
73           positive regulation of miRNA catabolic process
57                 proton motive force-driven ATP synthesis

Using the geneIdsByCategory() method, we extract the genes that enrich each GO term and paste it to the result.

geneIDs <- geneIdsByCategory(hgOverCondDysplasia)[resHgOverCondDysplasia$GOBPID]
geneSYMs <- sapply(geneIDs, function(id) select(org.Hs.eg.db, columns="SYMBOL", key=id, keytype="ENTREZID")$SYMBOL)
geneSYMs <- sapply(geneSYMs, paste, collapse=", ")
goresultsDysplasia <- cbind(resHgOverCondDysplasia, Genes=geneSYMs)
rownames(goresultsDysplasia) <- 1:nrow(goresultsDysplasia)

We generate an HTML page using the knitr and kableExtra packages.

See the HTML page by clicking on this link.

Table 2 displays the top 10 Go results (sorted by Odds Ratio).

Table 2: GO Enrichment Dysplasia
Pvalue OddsRatio ExpCount Count Size Term
1 0.0000206 27.753982 0.7839890 6 8 7-methylguanosine cap hypermethylation
2 0.0034803 27.683142 0.3919945 3 4 negative regulation of protein neddylation
3 0.0000000 21.497819 10.5279197 74 108 cytoplasmic translation
4 0.0000001 17.029630 1.6659765 11 17 mitochondrial electron transport, cytochrome c to oxygen
5 0.0080660 13.840247 0.4899931 3 5 positive regulation of miRNA catabolic process
6 0.0023570 12.556962 0.6739526 4 7 proton motive force-driven ATP synthesis
7 0.0025191 12.312132 0.6859903 4 7 protein maturation by [4Fe-4S] cluster transfer
8 0.0000049 11.920522 1.5679779 9 16 ribosomal small subunit assembly
9 0.0000860 10.798642 1.2739821 7 13 mitochondrial electron transport, ubiquinol to cytochrome c
10 0.0000000 9.430127 3.4969644 18 36 ribosomal small subunit biogenesis

4.2 Gene Set Enrichment Analysis

The second part of the functional analysis involves performing Gene Set Enrichment Analysis (GSEA), a computational method used to determine if predefined sets of genes show statistically significant and concordant differences between two biological states (e.g., phenotypes). In our study, we will conduct GSEA for two comparisons: identifying pathways significantly enriched between current and former smokers and between the presence and absence of premalignant lesions (PMLs).

4.2.1 GSEA for Smoking status

We load the necessary libraries: fgsea for GSEA, GSVAdata for gene set data, and GSEABase for gene set collection handling. We load the c2BroadSets dataset, which contains gene sets from KEGG, REACTOME, and BIOCARTA pathways.

library(fgsea)
library(GSVAdata)
Loading required package: GSEABase
Loading required package: hgu95a.db

library(GSEABase)
data(c2BroadSets)

c2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
  +                              grep("^REACTOME", names(c2BroadSets)),
  +                              grep("^BIOCARTA", names(c2BroadSets)))]
c2BroadSets
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
  unique identifiers: 55902, 2645, ..., 8544 (6744 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

In order to do the GSEA anlysis we decided to compare our dataset with two more datasets from the MSigDB. One dataset represents the control group and the other the tumor group. We use the msigdbr library to access the Molecular Signatures Database (MSigDB) for the datasets. The dplyr library is used for data manipulation.


library(msigdbr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:GSEABase':

    intersect, setdiff, union
The following object is masked from 'package:graph':

    union
The following object is masked from 'package:AnnotationDbi':

    select
The following object is masked from 'package:Biobase':

    combine
The following objects are masked from 'package:GenomicRanges':

    intersect, setdiff, union
The following object is masked from 'package:GenomeInfoDb':

    intersect
The following objects are masked from 'package:IRanges':

    collapse, desc, intersect, setdiff, slice, union
The following objects are masked from 'package:S4Vectors':

    first, intersect, rename, setdiff, setequal, union
The following objects are masked from 'package:BiocGenerics':

    combine, intersect, setdiff, union
The following object is masked from 'package:matrixStats':

    count
The following object is masked from 'package:kableExtra':

    group_rows
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
msigdb_data <- msigdbr(species = "Homo sapiens")
str(msigdb_data)
tibble [4,331,807 × 15] (S3: tbl_df/tbl/data.frame)
 $ gs_cat            : chr [1:4331807] "C3" "C3" "C3" "C3" ...
 $ gs_subcat         : chr [1:4331807] "MIR:MIR_Legacy" "MIR:MIR_Legacy" "MIR:MIR_Legacy" "MIR:MIR_Legacy" ...
 $ gs_name           : chr [1:4331807] "AAACCAC_MIR140" "AAACCAC_MIR140" "AAACCAC_MIR140" "AAACCAC_MIR140" ...
 $ gene_symbol       : chr [1:4331807] "ABCC4" "ABRAXAS2" "ACTN4" "ACTN4" ...
 $ entrez_gene       : int [1:4331807] 10257 23172 81 81 90 8754 8754 11096 219287 287 ...
 $ ensembl_gene      : chr [1:4331807] "ENSG00000125257" "ENSG00000165660" "ENSG00000130402" "ENSG00000282844" ...
 $ human_gene_symbol : chr [1:4331807] "ABCC4" "ABRAXAS2" "ACTN4" "ACTN4" ...
 $ human_entrez_gene : int [1:4331807] 10257 23172 81 81 90 8754 8754 11096 219287 287 ...
 $ human_ensembl_gene: chr [1:4331807] "ENSG00000125257" "ENSG00000165660" "ENSG00000130402" "ENSG00000282844" ...
 $ gs_id             : chr [1:4331807] "M12609" "M12609" "M12609" "M12609" ...
 $ gs_pmid           : chr [1:4331807] "" "" "" "" ...
 $ gs_geoid          : chr [1:4331807] "" "" "" "" ...
 $ gs_exact_source   : chr [1:4331807] "" "" "" "" ...
 $ gs_url            : chr [1:4331807] "" "" "" "" ...
 $ gs_description    : chr [1:4331807] "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ "Genes having at least one occurence of the motif AAACCAC in their 3' untranslated region. The motif represents "| __truncated__ ...
control <- msigdb_data %>% filter(gs_name == "DESCARTES_FETAL_LUNG_BRONCHIOLAR_AND_ALVEOLAR_EPITHELIAL_CELLS")
tumor <- msigdb_data %>% filter(gs_name == "DING_LUNG_CANCER_MUTATED_FREQUENTLY")

We create a custom gene set using the Entrez IDs of differentially expressed genes in smoking status. We then combine this custom gene set with the previously filtered gene sets into a single Gene Set Collection (gsc).

library(GSEABase)
control_genes <- as.character(unique(control$entrez_gene))
control_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=control_genes, setName="CONTROL")
details(control_GS)
setName: CONTROL 
geneIds: 55289, 114, ..., 7364 (total: 70)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null 
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:10 2024:1
description: 
organism: 
pubMedIds: 
urls: 
contributor: 
setVersion: 0.0.1
creationDate: 

tumor_genes <- as.character(unique(tumor$entrez_gene))
tumor_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=tumor_genes, setName="TUMOR")
details(tumor_GS)
setName: TUMOR 
geneIds: 91, 324, ..., 7157 (total: 12)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null 
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:10 2024:2
description: 
organism: 
pubMedIds: 
urls: 
contributor: 
setVersion: 0.0.1
creationDate: 

gsc_smoker <- GeneSetCollection(c(c2BroadSets, control_GS, tumor_GS))
gsets_smoker <- geneIds(gsc_smoker)

We prepare the data by converting the histology information into factors and then into integers.

histology_factors_smoker <- factor(se.filt$smoker)
histology_integers_smoker <- as.integer(histology_factors_smoker)

levels(histology_factors_smoker)
[1] "Current smoker" "Former smoker" 
histology_integers_smoker
 [1] 2 1 1 1 2 1 1 1 2 1 2 2 2 1 1 2 1 1 2 2 2 2 1 1 2 2 2 2 2 2 1 1 2 2 1 2 1 1
[39] 2 2 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 2 1 1 2 1 1 2 1 2 2 1 1 2 1 1 2 2

We use the fgseaLabel function to perform GSEA by permuting phenotypes 10,000 times to compute enrichment scores and p-values for each gene set.

gseapermpheno_smoker <- fgseaLabel(gsets_smoker, assays(se.filt)$logCPM, histology_integers_smoker,10000, minSize=5, maxSize=300)
head(gseapermpheno_smoker[order(gseapermpheno_smoker$pval), ])
                                                                                                   pathway
                                                                                                    <char>
1:                                                                          KEGG_VIBRIO_CHOLERAE_INFECTION
2:                                                                                     KEGG_PROTEIN_EXPORT
3:                                                                          KEGG_PENTOSE_PHOSPHATE_PATHWAY
4: REACTOME_NEF_MEDIATES_DOWN_MODULATION_OF_CELL_SURFACE_RECEPTORS_BY_RECRUITING_THEM_TO_CLATHRIN_ADAPTERS
5:                                                                             KEGG_GLUTATHIONE_METABOLISM
6:                                                                REACTOME_HOMOLOGOUS_RECOMBINATION_REPAIR
          pval     padj         ES       NES nMoreExtreme  size  leadingEdge
         <num>    <num>      <num>     <num>        <int> <int>       <list>
1: 0.006082009 0.998112 -0.5863669 -1.713281           30    47 10945, 5....
2: 0.006632852 0.998112 -0.7624549 -1.687242           33    23 28972, 9....
3: 0.008277493 0.998112 -0.6475517 -1.711022           41    21 6888, 70....
4: 0.013095238 0.998112 -0.6365695 -1.687739           65    18 1174, 51....
5: 0.013299433 0.998112 -0.6190362 -1.651290           67    40 2936, 28....
6: 0.019032822 0.998112  0.5912820  1.600837           97    14 3978, 96....

We prepare a ranked list of genes based on their differential expression statistics (tt_smoker$t). We then use the fgsea function to perform GSEA on this ranked list.

stats_smoker <- tt_smoker$t
names(stats_smoker) <- rownames(tt_smoker)
gseapermgsets_smoker <- fgsea(gsets_smoker, stats_smoker, minSize=5, maxSize=300)
head(gseapermgsets_smoker[order(gseapermgsets_smoker$pval), ])
                                                            pathway
                                                             <char>
1:                         REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
2:                                   KEGG_OXIDATIVE_PHOSPHORYLATION
3:                                             REACTOME_TRANSLATION
4: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
5:                 REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
6:                                          KEGG_PARKINSONS_DISEASE
           pval         padj  log2err         ES       NES  size  leadingEdge
          <num>        <num>    <num>      <num>     <num> <int>       <list>
1: 1.614771e-17 1.312809e-14 1.076868 -0.6042764 -2.832105   114 28972, 9....
2: 3.631782e-17 1.476319e-14 1.067210 -0.6002303 -2.813142   114 51606, 9....
3: 3.286682e-16 8.906909e-14 1.037696 -0.5959089 -2.785690   112 8667, 86....
4: 9.326339e-16 1.895578e-13 1.017545 -0.6093776 -2.811094    99 8667, 86....
5: 2.770603e-15 4.505000e-13 1.007318 -0.5430190 -2.627287   140 6513, 22....
6: 3.550234e-15 4.810566e-13 1.007318 -0.5804702 -2.710117   111 7345, 93....
gseapermpheno_smoker[gseapermpheno_smoker$padj < 0.01, ]
Empty data.table (0 rows and 8 cols): pathway,pval,padj,ES,NES,nMoreExtreme...
gseapermgsets_smoker[gseapermgsets_smoker$padj < 0.01, ]
                                                                                                    pathway
                                                                                                     <char>
 1:                                                                                    BIOCARTA_IL7_PATHWAY
 2:                                                                             BIOCARTA_PROTEASOME_PATHWAY
 3:                                                                             BIOCARTA_SALMONELLA_PATHWAY
 4:                                                                                 KEGG_ALZHEIMERS_DISEASE
 5:                                                        KEGG_AMINO_SUGAR_AND_NUCLEOTIDE_SUGAR_METABOLISM
 6:                                                             KEGG_CYTOKINE_CYTOKINE_RECEPTOR_INTERACTION
 7:                                                                             KEGG_GLUTATHIONE_METABOLISM
 8:                                                                                KEGG_HUNTINGTONS_DISEASE
 9:                                                                         KEGG_JAK_STAT_SIGNALING_PATHWAY
10:                                                                                           KEGG_LYSOSOME
11:                                                       KEGG_METABOLISM_OF_XENOBIOTICS_BY_CYTOCHROME_P450
12:                                                                              KEGG_N_GLYCAN_BIOSYNTHESIS
13:                                                                          KEGG_OXIDATIVE_PHOSPHORYLATION
14:                                                                                 KEGG_PARKINSONS_DISEASE
15:                                                                          KEGG_PENTOSE_PHOSPHATE_PATHWAY
16:                                                                                         KEGG_PROTEASOME
17:                                                                                     KEGG_PROTEIN_EXPORT
18:                                                                                           KEGG_RIBOSOME
19:                                                          KEGG_SNARE_INTERACTIONS_IN_VESICULAR_TRANSPORT
20:                                                                          KEGG_VIBRIO_CHOLERAE_INFECTION
21:                                                            REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC
22:                                              REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX
23:                                                               REACTOME_CLATHRIN_DERIVED_VESICLE_BUDDING
24:                                             REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_
25:                                                                       REACTOME_ELECTRON_TRANSPORT_CHAIN
26:                                                      REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING
27:                                                       REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS
28:                                                      REACTOME_FORMATION_OF_THE_EARLY_ELONGATION_COMPLEX
29:                              REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX
30:                                                       REACTOME_GENERATION_OF_SECOND_MESSENGER_MOLECULES
31:                                                        REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
32:                                                                        REACTOME_GLUTATHIONE_CONJUGATION
33:                                                            REACTOME_GOLGI_ASSOCIATED_VESICLE_BIOGENESIS
34:                                        REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
35:                                                                  REACTOME_HIV1_TRANSCRIPTION_ELONGATION
36:                                                                                  REACTOME_HIV_INFECTION
37:                                                                           REACTOME_INFLUENZA_LIFE_CYCLE
38:                                              REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION
39:                                                                REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
40:                                                               REACTOME_INTEGRATION_OF_ENERGY_METABOLISM
41:                                                                           REACTOME_MEMBRANE_TRAFFICKING
42:                                                                      REACTOME_METABOLISM_OF_AMINO_ACIDS
43:                                                                         REACTOME_METABOLISM_OF_PROTEINS
44:                                                                                REACTOME_M_G1_TRANSITION
45: REACTOME_NEF_MEDIATES_DOWN_MODULATION_OF_CELL_SURFACE_RECEPTORS_BY_RECRUITING_THEM_TO_CLATHRIN_ADAPTERS
46:                                                            REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE
47:                                                                       REACTOME_PEPTIDE_CHAIN_ELONGATION
48:                                                                           REACTOME_PHASE_II_CONJUGATION
49:                                   REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE
50:                                                            REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT
51:                                                    REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS
52:                                                                REACTOME_REGULATION_OF_INSULIN_SECRETION
53:                                                          REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE
54:                                                     REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1
55:                                                       REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21
56:                                                                               REACTOME_SIGNALING_BY_WNT
57:                                                                           REACTOME_STABILIZATION_OF_P53
58:                                   REACTOME_THE_ROLE_OF_NEF_IN_HIV1_REPLICATION_AND_DISEASE_PATHOGENESIS
59:                                                                REACTOME_TRANSCRIPTION_OF_THE_HIV_GENOME
60:                                                                                    REACTOME_TRANSLATION
61:                                                       REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION
62:                                                           REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G
63:                                                                         REACTOME_VIRAL_MRNA_TRANSLATION
                                                                                                    pathway
            pval         padj   log2err         ES       NES  size  leadingEdge
           <num>        <num>     <num>      <num>     <num> <int>       <list>
 1: 4.299426e-04 6.020500e-03 0.4984931  0.6724408  2.072147    16 3574, 20....
 2: 5.063540e-04 6.639771e-03 0.4772708 -0.6447528 -2.037740    19 6185, 56....
 3: 4.507973e-04 6.108303e-03 0.4984931 -0.7759773 -2.012846    10 10094, 1....
 4: 2.468758e-11 1.338067e-09 0.8634154 -0.4962823 -2.396818   139 2597, 11....
 5: 4.458925e-05 8.631205e-04 0.5573322 -0.5483911 -2.063953    38 2582, 27....
 6: 3.365383e-04 5.261647e-03 0.4984931  0.3679708  1.800944   106 3566, 63....
 7: 1.648727e-07 5.585063e-06 0.6901325 -0.6339983 -2.409353    40 2877, 29....
 8: 1.193424e-10 5.707375e-09 0.8266573 -0.4660142 -2.277193   157 148327, ....
 9: 4.051522e-06 9.687904e-05 0.6105269  0.4467455  2.085570    88 3566, 85....
10: 4.303598e-06 9.996643e-05 0.6105269 -0.4339797 -2.028721   112 3423, 47....
11: 1.495416e-07 5.526242e-06 0.6901325 -0.6461607 -2.401876    36 218, 164....
12: 3.573577e-04 5.319976e-03 0.4984931 -0.4985627 -1.942198    44 8704, 19....
13: 3.631782e-17 1.476319e-14 1.0672100 -0.6002303 -2.813142   114 51606, 9....
14: 3.550234e-15 4.810566e-13 1.0073180 -0.5804702 -2.710117   111 7345, 93....
15: 1.522145e-04 2.690226e-03 0.5188481 -0.6503176 -2.107616    21 6888, 70....
16: 1.677839e-06 4.400267e-05 0.6435518 -0.6010822 -2.298384    41 5717, 57....
17: 3.095880e-08 1.324711e-06 0.7195128 -0.7634091 -2.534039    23 28972, 9....
18: 6.290674e-15 7.306169e-13 0.9969862 -0.6367641 -2.783399    80 28998, 5....
19: 4.036133e-04 5.859600e-03 0.4984931 -0.5238744 -1.947320    36 10652, 6....
20: 5.677129e-07 1.709447e-05 0.6594444 -0.5811042 -2.294809    47 10945, 1....
21: 3.301747e-05 6.710800e-04 0.5573322 -0.5041317 -2.065682    56 5717, 73....
22: 3.553942e-04 5.319976e-03 0.4984931 -0.4775321 -1.914708    50 5717, 73....
23: 3.686913e-05 7.310879e-04 0.5573322 -0.4916843 -2.056021    59 2512, 24....
24: 5.558169e-05 1.050882e-03 0.5573322 -0.4952473 -2.031736    55 5717, 65....
25: 2.798918e-13 1.896267e-11 0.9325952 -0.6329125 -2.720252    71 9377, 47....
26: 3.169378e-04 5.153408e-03 0.4984931 -0.6999149 -2.040987    15 515, 514....
27: 3.497068e-14 2.584651e-12 0.9653278 -0.6129844 -2.750971    88 8667, 86....
28: 4.108636e-04 5.860213e-03 0.4984931 -0.5623485 -1.996062    29 22916, 5....
29: 3.201969e-06 7.888486e-05 0.6272567 -0.5825199 -2.269262    44 8667, 86....
30: 1.122466e-04 2.074011e-03 0.5384341  0.6091808  2.161551    26 5058, 51....
31: 2.770603e-15 4.505000e-13 1.0073180 -0.5430190 -2.627287   140 6513, 22....
32: 3.264406e-04 5.203847e-03 0.4984931 -0.7159891 -2.062425    14 4257, 27....
33: 1.713046e-04 2.963206e-03 0.5188481 -0.4930281 -1.992204    52 2512, 24....
34: 9.326339e-16 1.895578e-13 1.0175448 -0.6093776 -2.811094    99 8667, 86....
35: 1.506294e-04 2.690226e-03 0.5188481 -0.5226089 -1.966918    38 22916, 5....
36: 5.597472e-06 1.264096e-04 0.6105269 -0.3708108 -1.841777   176 1174, 51....
37: 6.691038e-06 1.431530e-04 0.6105269 -0.4085555 -1.961145   129 5436, 56....
38: 7.265374e-15 7.383437e-13 0.9865463 -0.6145290 -2.773292    92 5436, 56....
39: 1.614771e-17 1.312809e-14 1.0768682 -0.6042764 -2.832105   114 28972, 9....
40: 3.676574e-13 2.299273e-11 0.9325952 -0.4680134 -2.356694   191 6888, 70....
41: 1.569938e-07 5.549391e-06 0.6901325 -0.5164652 -2.243162    76 2512, 22....
42: 5.962737e-08 2.423853e-06 0.7049757 -0.4542933 -2.168565   125 1728, 23....
43: 3.408285e-11 1.731835e-09 0.8513391 -0.4514737 -2.276909   194 8667, 54....
44: 4.369121e-04 6.020500e-03 0.4984931 -0.4562860 -1.889273    57 5717, 73....
45: 4.648411e-04 6.195341e-03 0.4984931 -0.6477526 -2.016798    18 1174, 51....
46: 6.007326e-07 1.744270e-05 0.6594444 -0.5975739 -2.303853    42 5717, 73....
47: 2.387782e-14 1.941267e-12 0.9759947 -0.6354136 -2.773331    78 6134, 90....
48: 2.129517e-04 3.533259e-03 0.5188481 -0.5850268 -2.086252    30 4257, 27....
49: 3.599000e-04 5.319976e-03 0.4984931 -0.4277824 -1.835476    69 5717, 65....
50: 7.109836e-08 2.752522e-06 0.7049757 -0.5069179 -2.260858    87 3171, 61....
51: 2.970462e-08 1.324711e-06 0.7337620 -0.5272849 -2.319129    83 3171, 61....
52: 5.999577e-12 3.484040e-10 0.8870750 -0.4812404 -2.384412   171 6513, 22....
53: 2.548378e-07 8.287327e-06 0.6749629 -0.6033810 -2.354751    46 1728, 57....
54: 2.199731e-06 5.588691e-05 0.6272567 -0.5726409 -2.260105    48 5717, 65....
55: 2.075988e-05 4.327636e-04 0.5756103 -0.5266596 -2.111689    50 5717, 65....
56: 6.109282e-06 1.342391e-04 0.6105269 -0.5196133 -2.156391    58 5528, 57....
57: 7.424768e-07 2.081495e-05 0.6594444 -0.5914674 -2.308257    46 5717, 73....
58: 1.821620e-04 3.085369e-03 0.5188481 -0.6221308 -2.091324    24 1174, 51....
59: 6.603218e-04 8.521296e-03 0.4772708 -0.4505919 -1.846302    56 6884, 68....
60: 3.286682e-16 8.906909e-14 1.0376962 -0.5959089 -2.785690   112 8667, 86....
61: 3.872432e-07 1.210880e-05 0.6749629 -0.5771814 -2.322203    51 8667, 86....
62: 8.305533e-07 2.250800e-05 0.6594444 -0.5760275 -2.274761    47 5717, 73....
63: 2.348315e-14 1.941267e-12 0.9759947 -0.6356636 -2.774422    78 5611, 61....
            pval         padj   log2err         ES       NES  size  leadingEdge
significant_pathways_gsets_smoker <- gseapermgsets_smoker[gseapermgsets_smoker$padj < 0.01, ]
significant_pathways_gsets_ordered_smoker <- significant_pathways_gsets_smoker[order(significant_pathways_gsets_smoker$padj), ]
head(significant_pathways_gsets_ordered_smoker, n = 10)
                                                             pathway
                                                              <char>
 1:                         REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
 2:                                   KEGG_OXIDATIVE_PHOSPHORYLATION
 3:                                             REACTOME_TRANSLATION
 4: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
 5:                 REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
 6:                                          KEGG_PARKINSONS_DISEASE
 7:                                                    KEGG_RIBOSOME
 8:       REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION
 9:                                REACTOME_PEPTIDE_CHAIN_ELONGATION
10:                                  REACTOME_VIRAL_MRNA_TRANSLATION
            pval         padj   log2err         ES       NES  size  leadingEdge
           <num>        <num>     <num>      <num>     <num> <int>       <list>
 1: 1.614771e-17 1.312809e-14 1.0768682 -0.6042764 -2.832105   114 28972, 9....
 2: 3.631782e-17 1.476319e-14 1.0672100 -0.6002303 -2.813142   114 51606, 9....
 3: 3.286682e-16 8.906909e-14 1.0376962 -0.5959089 -2.785690   112 8667, 86....
 4: 9.326339e-16 1.895578e-13 1.0175448 -0.6093776 -2.811094    99 8667, 86....
 5: 2.770603e-15 4.505000e-13 1.0073180 -0.5430190 -2.627287   140 6513, 22....
 6: 3.550234e-15 4.810566e-13 1.0073180 -0.5804702 -2.710117   111 7345, 93....
 7: 6.290674e-15 7.306169e-13 0.9969862 -0.6367641 -2.783399    80 28998, 5....
 8: 7.265374e-15 7.383437e-13 0.9865463 -0.6145290 -2.773292    92 5436, 56....
 9: 2.387782e-14 1.941267e-12 0.9759947 -0.6354136 -2.773331    78 6134, 90....
10: 2.348315e-14 1.941267e-12 0.9759947 -0.6356636 -2.774422    78 5611, 61....

We create histograms of p-values from both GSEA methods (permuting phenotypes and permuting gene sets) to visualize the distribution of p-values. We also generate an enrichment plot for the custom gene set DE_FC_SMOKER to show the enrichment score across the ranked gene list.

par(mfrow=c(1, 2))
hist(gseapermpheno_smoker$pval, xlab="P-value", main="Permuting phenotypes", las=1)
hist(gseapermgsets_smoker$pval, xlab="P-value", main="Permuting gene sets", las=1)

plotEnrichment(gsets_smoker$CONTROL, stats_smoker)

plotEnrichment(gsets_smoker$TUMOR, stats_smoker)

We use this part of the code to identify pathways that are significantly enriched (adjusted p-value < 0.01). The plotGseaTable function then generates a table and plot of GSEA results for these significant pathways, showing the enrichment scores and p-values. This helps us visualize and interpret the most relevant pathways in our smoking status analysis.

DEpwys <- gseapermgsets_smoker$pathway[gseapermgsets_smoker$padj < 0.01]
plotGseaTable(gsets_smoker[DEpwys], stats_smoker, gseapermgsets_smoker, gseaParam=0.5)

We generate an HTML page so we can visualize more easily.

library(kableExtra)
gsea_tab_smoker <- kable(significant_pathways_gsets_ordered_smoker, "html", caption="GSEA Results Smoker")
gsea_tab_smoker <- kable_styling(gsea_tab_smoker, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
save_kable(gsea_tab_smoker, file="./results/gsea_results_smoker.html", self_contained=TRUE)

See the HTML page by clicking on this link.

We use the limma::camera function to perform an additional gene set analysis that adjusts for inter-gene correlation within gene sets. This step enhances the robustness of our findings by providing an alternative method to validate the significant pathways identified by GSEA. The camera function uses the same gene sets and design matrix as our previous analyses.

gsetidx_smoker <- ids2indices(geneIds(gsc_smoker), rownames(se.filt))
camerares_smoker <- camera(y=assays(se.filt)$logCPM, index=gsetidx_smoker, design=mod, contrast=2)
head(camerares_smoker, 5)
                                                           NGenes Direction
KEGG_RIBOSOME                                                  80      Down
REACTOME_PEPTIDE_CHAIN_ELONGATION                              78      Down
REACTOME_VIRAL_MRNA_TRANSLATION                                78      Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS              88      Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION     92      Down
                                                                 PValue
KEGG_RIBOSOME                                              2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION                          3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION                            1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS          1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 2.731443e-38
                                                                    FDR
KEGG_RIBOSOME                                              1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION                          1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION                            3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS          3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 4.550585e-36

We filter the Camera results to focus on pathways with an FDR < 0.01, ensuring we only consider the most significant pathways. This step helps us further refine our list of relevant pathways.

camerares_smoker[camerares_smoker$FDR < 0.01, ]
                                                                           NGenes
KEGG_RIBOSOME                                                                  80
REACTOME_PEPTIDE_CHAIN_ELONGATION                                              78
REACTOME_VIRAL_MRNA_TRANSLATION                                                78
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                              88
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                     92
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                           83
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT               99
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                      114
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                                   87
REACTOME_TRANSLATION                                                          112
REACTOME_INFLUENZA_LIFE_CYCLE                                                 129
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX     44
REACTOME_DIABETES_PATHWAYS                                                    312
REACTOME_METABOLISM_OF_PROTEINS                                               194
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                              51
KEGG_PARKINSONS_DISEASE                                                       111
KEGG_OXIDATIVE_PHOSPHORYLATION                                                114
REACTOME_ELECTRON_TRANSPORT_CHAIN                                              71
KEGG_ALZHEIMERS_DISEASE                                                       139
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                              140
KEGG_HUNTINGTONS_DISEASE                                                      157
REACTOME_REGULATION_OF_INSULIN_SECRETION                                      171
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                                  47
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                             15
KEGG_PROTEASOME                                                                41
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                                   42
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                                 46
REACTOME_STABILIZATION_OF_P53                                                  46
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                    55
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                            48
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                     191
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                              50
REACTOME_SIGNALING_BY_WNT                                                      58
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC                  21
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                    61
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                     50
REACTOME_GENE_EXPRESSION                                                      397
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                                   56
REACTOME_M_G1_TRANSITION                                                       57
REACTOME_G1_S_TRANSITION                                                       88
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE          69
KEGG_PROTEIN_EXPORT                                                            23
BIOCARTA_ETC_PATHWAY                                                           12
REACTOME_DNA_REPLICATION_PRE_INITIATION                                        68
BIOCARTA_RAB_PATHWAY                                                           10
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                           61
REACTOME_METABOLISM_OF_AMINO_ACIDS                                            125
REACTOME_CELL_CYCLE_CHECKPOINTS                                                97
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC                15
REACTOME_S_PHASE                                                               93
REACTOME_HIV_INFECTION                                                        176
REACTOME_SYNTHESIS_OF_DNA                                                      82
BIOCARTA_PROTEASOME_PATHWAY                                                    19
KEGG_CARDIAC_MUSCLE_CONTRACTION                                                46
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                     116
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                           40
BIOCARTA_SALMONELLA_PATHWAY                                                    10
BIOCARTA_CDC42RAC_PATHWAY                                                      14
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                              19
REACTOME_MITOTIC_M_M_G1_PHASES                                                137
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                         12
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                         36
KEGG_CITRATE_CYCLE_TCA_CYCLE                                                   27
                                                                           Direction
KEGG_RIBOSOME                                                                   Down
REACTOME_PEPTIDE_CHAIN_ELONGATION                                               Down
REACTOME_VIRAL_MRNA_TRANSLATION                                                 Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                               Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                      Down
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                            Down
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT                Down
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                        Down
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                                    Down
REACTOME_TRANSLATION                                                            Down
REACTOME_INFLUENZA_LIFE_CYCLE                                                   Down
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX      Down
REACTOME_DIABETES_PATHWAYS                                                      Down
REACTOME_METABOLISM_OF_PROTEINS                                                 Down
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                               Down
KEGG_PARKINSONS_DISEASE                                                         Down
KEGG_OXIDATIVE_PHOSPHORYLATION                                                  Down
REACTOME_ELECTRON_TRANSPORT_CHAIN                                               Down
KEGG_ALZHEIMERS_DISEASE                                                         Down
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                                Down
KEGG_HUNTINGTONS_DISEASE                                                        Down
REACTOME_REGULATION_OF_INSULIN_SECRETION                                        Down
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                                   Down
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                              Down
KEGG_PROTEASOME                                                                 Down
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                                    Down
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                                  Down
REACTOME_STABILIZATION_OF_P53                                                   Down
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                     Down
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                             Down
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                       Down
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                               Down
REACTOME_SIGNALING_BY_WNT                                                       Down
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC                   Down
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                     Down
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                      Down
REACTOME_GENE_EXPRESSION                                                        Down
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                                    Down
REACTOME_M_G1_TRANSITION                                                        Down
REACTOME_G1_S_TRANSITION                                                        Down
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE           Down
KEGG_PROTEIN_EXPORT                                                             Down
BIOCARTA_ETC_PATHWAY                                                            Down
REACTOME_DNA_REPLICATION_PRE_INITIATION                                         Down
BIOCARTA_RAB_PATHWAY                                                            Down
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                            Down
REACTOME_METABOLISM_OF_AMINO_ACIDS                                              Down
REACTOME_CELL_CYCLE_CHECKPOINTS                                                 Down
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC                 Down
REACTOME_S_PHASE                                                                Down
REACTOME_HIV_INFECTION                                                          Down
REACTOME_SYNTHESIS_OF_DNA                                                       Down
BIOCARTA_PROTEASOME_PATHWAY                                                     Down
KEGG_CARDIAC_MUSCLE_CONTRACTION                                                 Down
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                       Down
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                            Down
BIOCARTA_SALMONELLA_PATHWAY                                                     Down
BIOCARTA_CDC42RAC_PATHWAY                                                       Down
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                               Down
REACTOME_MITOTIC_M_M_G1_PHASES                                                  Down
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                          Down
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                          Down
KEGG_CITRATE_CYCLE_TCA_CYCLE                                                    Down
                                                                                 PValue
KEGG_RIBOSOME                                                              2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION                                          3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION                                            1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                          1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                 2.731443e-38
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                       1.145772e-35
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT           1.225223e-35
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                   1.019429e-33
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                               2.283164e-33
REACTOME_TRANSLATION                                                       1.171285e-32
REACTOME_INFLUENZA_LIFE_CYCLE                                              3.033430e-22
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 5.392902e-22
REACTOME_DIABETES_PATHWAYS                                                 8.692926e-22
REACTOME_METABOLISM_OF_PROTEINS                                            5.349591e-20
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                          9.395454e-20
KEGG_PARKINSONS_DISEASE                                                    1.839192e-15
KEGG_OXIDATIVE_PHOSPHORYLATION                                             3.491358e-15
REACTOME_ELECTRON_TRANSPORT_CHAIN                                          5.005995e-15
KEGG_ALZHEIMERS_DISEASE                                                    3.306168e-12
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                           4.568293e-11
KEGG_HUNTINGTONS_DISEASE                                                   3.248647e-10
REACTOME_REGULATION_OF_INSULIN_SECRETION                                   1.611900e-09
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                              2.700881e-09
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                         1.052427e-08
KEGG_PROTEASOME                                                            1.058222e-08
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                               3.144120e-08
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                             3.740179e-08
REACTOME_STABILIZATION_OF_P53                                              6.522039e-08
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                7.748487e-08
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                        8.120236e-08
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                  9.625664e-08
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                          1.853028e-07
REACTOME_SIGNALING_BY_WNT                                                  1.932560e-07
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC              3.724800e-07
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                4.225662e-07
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                 5.652607e-07
REACTOME_GENE_EXPRESSION                                                   6.425514e-07
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                               7.186981e-07
REACTOME_M_G1_TRANSITION                                                   1.059953e-06
REACTOME_G1_S_TRANSITION                                                   1.682678e-06
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE      1.857432e-06
KEGG_PROTEIN_EXPORT                                                        2.806326e-06
BIOCARTA_ETC_PATHWAY                                                       3.925179e-06
REACTOME_DNA_REPLICATION_PRE_INITIATION                                    4.241129e-06
BIOCARTA_RAB_PATHWAY                                                       7.089697e-06
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                       7.910600e-06
REACTOME_METABOLISM_OF_AMINO_ACIDS                                         1.757114e-05
REACTOME_CELL_CYCLE_CHECKPOINTS                                            1.821139e-05
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC            2.301368e-05
REACTOME_S_PHASE                                                           2.778510e-05
REACTOME_HIV_INFECTION                                                     3.533418e-05
REACTOME_SYNTHESIS_OF_DNA                                                  5.736754e-05
BIOCARTA_PROTEASOME_PATHWAY                                                1.466427e-04
KEGG_CARDIAC_MUSCLE_CONTRACTION                                            1.928235e-04
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                  2.272538e-04
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                       2.721551e-04
BIOCARTA_SALMONELLA_PATHWAY                                                2.757882e-04
BIOCARTA_CDC42RAC_PATHWAY                                                  3.143116e-04
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                          4.647822e-04
REACTOME_MITOTIC_M_M_G1_PHASES                                             4.659876e-04
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                     4.930346e-04
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                     4.994264e-04
KEGG_CITRATE_CYCLE_TCA_CYCLE                                               5.506234e-04
                                                                                    FDR
KEGG_RIBOSOME                                                              1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION                                          1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION                                            3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                          3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                 4.550585e-36
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                       1.458015e-33
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT           1.458015e-33
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                   1.061480e-31
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                               2.113196e-31
REACTOME_TRANSLATION                                                       9.756803e-31
REACTOME_INFLUENZA_LIFE_CYCLE                                              2.297134e-20
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 3.743573e-20
REACTOME_DIABETES_PATHWAYS                                                 5.570160e-20
REACTOME_METABOLISM_OF_PROTEINS                                            3.183007e-18
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                          5.217609e-18
KEGG_PARKINSONS_DISEASE                                                    9.575296e-14
KEGG_OXIDATIVE_PHOSPHORYLATION                                             1.710766e-13
REACTOME_ELECTRON_TRANSPORT_CHAIN                                          2.316663e-13
KEGG_ALZHEIMERS_DISEASE                                                    1.449494e-10
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                           1.902694e-09
KEGG_HUNTINGTONS_DISEASE                                                   1.288630e-08
REACTOME_REGULATION_OF_INSULIN_SECRETION                                   6.103241e-08
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                              9.781886e-08
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                         3.525994e-07
KEGG_PROTEASOME                                                            3.525994e-07
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                               1.007328e-06
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                             1.153915e-06
REACTOME_STABILIZATION_OF_P53                                              1.940307e-06
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                2.225686e-06
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                        2.254719e-06
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                  2.586509e-06
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                          4.823664e-06
REACTOME_SIGNALING_BY_WNT                                                  4.878251e-06
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC              9.125760e-06
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                1.005707e-05
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                 1.307950e-05
REACTOME_GENE_EXPRESSION                                                   1.446609e-05
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                               1.575462e-05
REACTOME_M_G1_TRANSITION                                                   2.263951e-05
REACTOME_G1_S_TRANSITION                                                   3.504176e-05
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE      3.773759e-05
KEGG_PROTEIN_EXPORT                                                        5.565880e-05
BIOCARTA_ETC_PATHWAY                                                       7.603893e-05
REACTOME_DNA_REPLICATION_PRE_INITIATION                                    8.029228e-05
BIOCARTA_RAB_PATHWAY                                                       1.312382e-04
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                       1.432507e-04
REACTOME_METABOLISM_OF_AMINO_ACIDS                                         3.114204e-04
REACTOME_CELL_CYCLE_CHECKPOINTS                                            3.160435e-04
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC            3.912325e-04
REACTOME_S_PHASE                                                           4.628998e-04
REACTOME_HIV_INFECTION                                                     5.771249e-04
REACTOME_SYNTHESIS_OF_DNA                                                  9.189839e-04
BIOCARTA_PROTEASOME_PATHWAY                                                2.304781e-03
KEGG_CARDIAC_MUSCLE_CONTRACTION                                            2.974481e-03
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                  3.441862e-03
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                       4.030378e-03
BIOCARTA_SALMONELLA_PATHWAY                                                4.030378e-03
BIOCARTA_CDC42RAC_PATHWAY                                                  4.514165e-03
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                          6.469461e-03
REACTOME_MITOTIC_M_M_G1_PHASES                                             6.469461e-03
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                     6.710035e-03
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                     6.710035e-03
KEGG_CITRATE_CYCLE_TCA_CYCLE                                               7.280464e-03

We create a histogram of the p-values from the Camera results to visualize the distribution of significance levels across all tested pathways. This plot provides an overview of the significance of the pathways in relation to dysplasia.

hist(camerares_smoker$PValue, xlab="P-value", main="Camera", las=1)

4.2.2 GSEA for Dysplasia (PML)

We do the same steps as previously but now for the Dysplasia. Load the required libraries and data.

library(fgsea)
library(GSVAdata)
library(GSEABase)
data(c2BroadSets)

c2BroadSets <- c2BroadSets[c(grep("^KEGG", names(c2BroadSets)),
  +                              grep("^REACTOME", names(c2BroadSets)),
  +                              grep("^BIOCARTA", names(c2BroadSets)))]
c2BroadSets
GeneSetCollection
  names: KEGG_GLYCOLYSIS_GLUCONEOGENESIS, KEGG_CITRATE_CYCLE_TCA_CYCLE, ..., BIOCARTA_ACTINY_PATHWAY (833 total)
  unique identifiers: 55902, 2645, ..., 8544 (6744 total)
  types in collection:
    geneIdType: EntrezIdentifier (1 total)
    collectionType: BroadCollection (1 total)

Create a costum gene set for Dysplasia.


library(GSEABase)
control_genes <- as.character(unique(control$entrez_gene))
control_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=control_genes, setName="CONTROL")
details(control_GS)
setName: CONTROL 
geneIds: 55289, 114, ..., 7364 (total: 70)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null 
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:33 2024:3
description: 
organism: 
pubMedIds: 
urls: 
contributor: 
setVersion: 0.0.1
creationDate: 

tumor_genes <- as.character(unique(tumor$entrez_gene))
tumor_GS <- GeneSet(EntrezIdentifier("org.Hs.eg.db"), geneIds=tumor_genes, setName="TUMOR")
details(tumor_GS)
setName: TUMOR 
geneIds: 91, 324, ..., 7157 (total: 12)
geneIdType: EntrezId (org.Hs.eg.db)
collectionType: Null 
setIdentifier: Emmas-Laptop.local:51879:Tue Aug 20 19:35:33 2024:4
description: 
organism: 
pubMedIds: 
urls: 
contributor: 
setVersion: 0.0.1
creationDate: 

gsc_dysplasia <- GeneSetCollection(c(c2BroadSets, control_GS, tumor_GS))
gsets_dysplasia<- geneIds(gsc_dysplasia)

  

Prepare the data.


histology_factors_dysplasia <- factor(se.filt$dysplasia)
histology_integers_dysplasia <- as.integer(histology_factors_dysplasia)

levels(histology_factors_dysplasia)
[1] "dysplasia"    "no dysplasia"
histology_integers_dysplasia
 [1] 1 1 1 1 1 1 1 1 1 1 1 2 2 1 1 2 1 2 2 2 2 1 1 2 2 2 2 1 1 1 1 2 1 1 2 2 2 1
[39] 1 1 1 2 2 1 1 2 2 1 2 1 1 1 2 2 1 1 1 1 2 1 1 2 1 1 2 1 1 1 1 1 1 1 1 1

Run the GSEA by permuting phenotypes.


gseapermpheno_dysplasia <- fgseaLabel(gsets_dysplasia, assays(se.filt)$logCPM, histology_integers_dysplasia,10000, minSize=5, maxSize=300)
head(gseapermpheno_dysplasia[order(gseapermpheno_dysplasia$pval)])
                                                                      pathway
                                                                       <char>
1:                         REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING
2:                                                    KEGG_PARKINSONS_DISEASE
3:                                                    KEGG_ALZHEIMERS_DISEASE
4:                                   REACTOME_REGULATION_OF_INSULIN_SECRETION
5:                                                   KEGG_HUNTINGTONS_DISEASE
6: REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX
           pval      padj         ES       NES nMoreExtreme  size  leadingEdge
          <num>     <num>      <num>     <num>        <int> <int>       <list>
1: 0.0001966955 0.1599135 -0.9322777 -1.778352            0    15 506, 106....
2: 0.0005896226 0.1607991 -0.7284616 -1.862856            2   111 506, 471....
3: 0.0007889546 0.1607991 -0.6349335 -1.910018            3   139 506, 471....
4: 0.0007911392 0.1607991 -0.5998803 -1.829791            3   171 506, 471....
5: 0.0013853157 0.1766344 -0.6018486 -1.879833            6   157 506, 471....
6: 0.0015822785 0.1766344 -0.8644775 -1.788427            7    44 6217, 21....

Run GSEA with the ranked gene list.

stats_dysplasia <- tt_dysplasia$t
names(stats_dysplasia) <- rownames(tt_dysplasia)
gseapermgsets_dysplasia <- fgsea(gsets_dysplasia, stats_dysplasia, minSize=5, maxSize=300)
Warning in fgseaMultilevel(pathways = pathways, stats = stats, minSize =
minSize, : For some pathways, in reality P-values are less than 1e-50. You can
set the `eps` argument to zero for better estimation.
head(gseapermgsets_dysplasia[order(gseapermgsets_dysplasia$pval), ])
                                                            pathway  pval
                                                             <char> <num>
1:                                                    KEGG_RIBOSOME 1e-50
2:                REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1e-50
3: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1e-50
4:       REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 1e-50
5:                         REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1e-50
6:                                REACTOME_PEPTIDE_CHAIN_ELONGATION 1e-50
       padj log2err         ES       NES  size  leadingEdge
      <num>   <num>      <num>     <num> <int>       <list>
1: 8.13e-49      NA -0.9313928 -4.274983    80 6170, 47....
2: 8.13e-49      NA -0.9007982 -4.235589    88 6170, 47....
3: 8.13e-49      NA -0.8774321 -4.197048    99 6170, 47....
4: 8.13e-49      NA -0.8692742 -4.102250    92 6170, 47....
5: 8.13e-49      NA -0.8104961 -3.979237   114 6170, 47....
6: 8.13e-49      NA -0.9278116 -4.237599    78 6170, 47....
gseapermpheno_dysplasia[gseapermpheno_dysplasia$padj < 0.01, ]
Empty data.table (0 rows and 8 cols): pathway,pval,padj,ES,NES,nMoreExtreme...
gseapermgsets_dysplasia[gseapermgsets_dysplasia$padj < 0.01, ]
                                                                       pathway
                                                                        <char>
 1:                                                    BIOCARTA_ACTINY_PATHWAY
 2:                                                  BIOCARTA_CDC42RAC_PATHWAY
 3:                                                   BIOCARTA_CHREBP2_PATHWAY
 4:                                                       BIOCARTA_ETC_PATHWAY
 5:                                                       BIOCARTA_MPR_PATHWAY
 6:                                                BIOCARTA_PROTEASOME_PATHWAY
 7:                                                       BIOCARTA_RAB_PATHWAY
 8:                                                BIOCARTA_SALMONELLA_PATHWAY
 9:                                                      BIOCARTA_SARS_PATHWAY
10:                                                    KEGG_ALZHEIMERS_DISEASE
11:                                     KEGG_AMYOTROPHIC_LATERAL_SCLEROSIS_ALS
12:                                       KEGG_ARGININE_AND_PROLINE_METABOLISM
13:                                            KEGG_CARDIAC_MUSCLE_CONTRACTION
14:                                               KEGG_CITRATE_CYCLE_TCA_CYCLE
15:                                                   KEGG_HUNTINGTONS_DISEASE
16:                                                 KEGG_N_GLYCAN_BIOSYNTHESIS
17:                                             KEGG_OXIDATIVE_PHOSPHORYLATION
18:                                                    KEGG_PARKINSONS_DISEASE
19:                                 KEGG_PATHOGENIC_ESCHERICHIA_COLI_INFECTION
20:                                                            KEGG_PROTEASOME
21:                                                        KEGG_PROTEIN_EXPORT
22:                                                     KEGG_PURINE_METABOLISM
23:                                                              KEGG_RIBOSOME
24:                                                           KEGG_SPLICEOSOME
25:                                                  KEGG_STEROID_BIOSYNTHESIS
26:                         REACTOME_ADP_SIGNALLING_THROUGH_P2Y_PURINOCEPTOR_1
27:                                                         REACTOME_APOPTOSIS
28:                               REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC
29:                REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A
30:                 REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX
31:                                            REACTOME_CELL_CYCLE_CHECKPOINTS
32:                                                REACTOME_CELL_CYCLE_MITOTIC
33:                                          REACTOME_CHOLESTEROL_BIOSYNTHESIS
34:                REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_
35:                                     REACTOME_CYTOSOLIC_TRNA_AMINOACYLATION
36:                                                    REACTOME_DARPP32_EVENTS
37:                                    REACTOME_DNA_REPLICATION_PRE_INITIATION
38:                                  REACTOME_DUAL_INCISION_REACTION_IN_TC_NER
39:                                          REACTOME_ELECTRON_TRANSPORT_CHAIN
40:                   REACTOME_ELONGATION_AND_PROCESSING_OF_CAPPED_TRANSCRIPTS
41:                       REACTOME_FORMATION_AND_MATURATION_OF_MRNA_TRANSCRIPT
42:                         REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING
43:                          REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS
44:                         REACTOME_FORMATION_OF_THE_EARLY_ELONGATION_COMPLEX
45: REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX
46:            REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC
47:                                                   REACTOME_G1_S_TRANSITION
48:                           REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION
49:           REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT
50:                                     REACTOME_HIV1_TRANSCRIPTION_ELONGATION
51:                                     REACTOME_HIV1_TRANSCRIPTION_INITIATION
52:                                                     REACTOME_HIV_INFECTION
53:                                                    REACTOME_HIV_LIFE_CYCLE
54:                                  REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS
55:                                              REACTOME_INFLUENZA_LIFE_CYCLE
56:                 REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION
57:                                   REACTOME_INSULIN_SYNTHESIS_AND_SECRETION
58:                                  REACTOME_INTEGRATION_OF_ENERGY_METABOLISM
59:                                              REACTOME_MEMBRANE_TRAFFICKING
60:                                         REACTOME_METABOLISM_OF_AMINO_ACIDS
61:                                            REACTOME_METABOLISM_OF_PROTEINS
62:                                             REACTOME_MITOTIC_M_M_G1_PHASES
63:                                                     REACTOME_MRNA_SPLICING
64:                                       REACTOME_MRNA_SPLICING_MINOR_PATHWAY
65:                                                   REACTOME_M_G1_TRANSITION
66:                                       REACTOME_ORC1_REMOVAL_FROM_CHROMATIN
67:                               REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE
68:                                          REACTOME_PEPTIDE_CHAIN_ELONGATION
69:              REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC
70:                   REACTOME_PROCESSING_OF_CAPPED_INTRON_CONTAINING_PRE_MRNA
71:                                 REACTOME_PYRUVATE_METABOLISM_AND_TCA_CYCLE
72:      REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE
73:                               REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT
74:                       REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS
75:                                   REACTOME_REGULATION_OF_INSULIN_SECRETION
76:                             REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE
77:                        REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1
78:                          REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21
79:                                                  REACTOME_SIGNALING_BY_WNT
80:                                              REACTOME_STABILIZATION_OF_P53
81:                                                  REACTOME_SYNTHESIS_OF_DNA
82:                                                           REACTOME_S_PHASE
83:                                         REACTOME_TRANSCRIPTION_COUPLED_NER
84:                                   REACTOME_TRANSCRIPTION_OF_THE_HIV_GENOME
85:                                                       REACTOME_TRANSLATION
86:                          REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION
87:                              REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G
88:                                     REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS
89:                                            REACTOME_VIRAL_MRNA_TRANSLATION
                                                                       pathway
            pval         padj   log2err         ES       NES  size  leadingEdge
           <num>        <num>     <num>      <num>     <num> <int>       <list>
 1: 3.336223e-04 3.562039e-03 0.4984931 -0.6691403 -2.081378    16 10094, 5....
 2: 6.248539e-05 8.063591e-04 0.5384341 -0.7276557 -2.178454    14 10094, 3....
 3: 2.043628e-04 2.373528e-03 0.5188481 -0.5196364 -1.979624    35 10971, 5....
 4: 1.272674e-06 1.989777e-05 0.6435518 -0.8327598 -2.400357    12 54205, 6....
 5: 7.092562e-04 6.851226e-03 0.4772708 -0.5515447 -1.959232    26 10094, 1....
 6: 1.173923e-04 1.424476e-03 0.5384341 -0.6608645 -2.138371    19 5689, 61....
 7: 4.558323e-07 7.563095e-06 0.6749629 -0.8874452 -2.348127    10 5868, 58....
 8: 3.066344e-04 3.368834e-03 0.4984931 -0.7798890 -2.063540    10 10094, 5....
 9: 6.315221e-04 6.185873e-03 0.4772708 -0.7981492 -1.954955     8 2091, 19....
10: 4.098740e-26 1.851264e-24 1.3267161 -0.6401202 -3.256527   139 506, 542....
11: 3.352730e-04 3.562039e-03 0.4984931 -0.4998801 -1.950632    39 54205, 5....
12: 1.601275e-04 1.886719e-03 0.5188481 -0.5216940 -2.024517    38 4942, 44....
13: 1.271430e-08 3.040214e-07 0.7477397 -0.6234644 -2.556679    46 1329, 73....
14: 1.454147e-05 2.149494e-04 0.5933255 -0.6251585 -2.224068    27 8802, 63....
15: 1.447183e-25 6.192421e-24 1.3110148 -0.6075394 -3.135642   157 506, 542....
16: 3.736896e-05 5.149316e-04 0.5573322 -0.5175880 -2.089971    44 6184, 88....
17: 5.011556e-35 3.134150e-33 1.5364590 -0.7326132 -3.596861   114 506, 471....
18: 4.738118e-33 2.751493e-31 1.4954793 -0.7322581 -3.592772   111 506, 542....
19: 3.373641e-04 3.562039e-03 0.4984931 -0.4883693 -1.951885    42 10971, 1....
20: 1.173811e-08 2.891845e-07 0.7477397 -0.6432409 -2.555135    41 5689, 56....
21: 1.457398e-04 1.742448e-03 0.5188481 -0.6288415 -2.125164    23 6731, 10....
22: 9.254324e-05 1.157502e-03 0.5384341 -0.3677763 -1.815030   120 5634, 54....
23: 1.000000e-50 8.130000e-49        NA -0.9313928 -4.274983    80 6170, 47....
24: 2.339411e-04 2.641585e-03 0.5188481 -0.3487875 -1.737921   124 6632, 66....
25: 5.073252e-04 5.092042e-03 0.4772708 -0.6809386 -2.038593    14 6309, 22....
26: 9.039515e-04 8.447271e-03 0.4772708 -0.6442683 -1.930484    15 2769, 55....
27: 2.148243e-05 3.064073e-04 0.5756103 -0.3868708 -1.900958   113 54205, 5....
28: 1.124826e-07 2.078372e-06 0.7049757 -0.5586646 -2.404575    56 7324, 56....
29: 1.032019e-07 1.951236e-06 0.7049757 -0.5520804 -2.420206    61 7324, 56....
30: 9.470563e-08 1.833230e-06 0.7049757 -0.5797406 -2.441094    50 5689, 57....
31: 8.910281e-09 2.336793e-07 0.7477397 -0.4892937 -2.330576    97 7324, 56....
32: 5.560917e-04 5.513446e-03 0.4772708 -0.2775420 -1.544275   268 6232, 73....
33: 9.957026e-05 1.226525e-03 0.5384341 -0.6654635 -2.153252    19 6309, 10....
34: 4.637012e-09 1.396256e-07 0.7614608 -0.5979572 -2.560842    55 5689, 57....
35: 5.052906e-04 5.092042e-03 0.4772708 -0.5992474 -2.025151    23 9255, 16....
36: 9.021179e-04 8.447271e-03 0.4772708 -0.5545511 -1.913469    24 5515, 10....
37: 1.793391e-08 3.940613e-07 0.7337620 -0.5346576 -2.391126    68 5689, 57....
38: 2.940978e-05 4.122439e-04 0.5756103 -0.6277078 -2.203301    25 5440, 54....
39: 4.267074e-27 2.040665e-25 1.3499257 -0.7782906 -3.505758    71 54205, 4....
40: 1.046205e-05 1.575119e-04 0.5933255 -0.3753481 -1.885063   130 6632, 66....
41: 8.571122e-06 1.314778e-04 0.5933255 -0.3650816 -1.876195   148 2958, 66....
42: 1.060214e-13 3.747626e-12 0.9545416 -0.9350571 -2.801803    15 506, 106....
43: 1.000000e-50 8.130000e-49        NA -0.9007982 -4.235589    88 6170, 47....
44: 4.751831e-04 4.890175e-03 0.4984931 -0.5513123 -1.984757    29 5440, 29....
45: 2.451075e-25 9.489161e-24 1.3030932 -0.8642115 -3.489603    44 6217, 21....
46: 3.835032e-05 5.196469e-04 0.5573322 -0.7199355 -2.157213    15 10694, 9....
47: 5.428916e-09 1.576325e-07 0.7614608 -0.5161450 -2.426935    88 5689, 57....
48: 3.582757e-29 1.941854e-27 1.4025887 -0.6559124 -3.335768   140 506, 542....
49: 1.000000e-50 8.130000e-49        NA -0.8774321 -4.197048    99 6170, 47....
50: 2.455509e-04 2.734697e-03 0.4984931 -0.5118734 -1.986407    38 6921, 54....
51: 4.794440e-05 6.286903e-04 0.5573322 -0.5524893 -2.106255    36 2958, 68....
52: 1.321673e-11 4.477169e-10 0.8753251 -0.4451817 -2.334587   176 2958, 54....
53: 2.115669e-04 2.422590e-03 0.5188481 -0.3728299 -1.789787   100 2958, 54....
54: 4.136405e-07 7.006037e-06 0.6749629 -0.4320191 -2.126484   116 5478, 49....
55: 1.957664e-44 1.446892e-42 1.7387813 -0.7574919 -3.793314   129 6170, 47....
56: 1.000000e-50 8.130000e-49        NA -0.8692742 -4.102250    92 6170, 47....
57: 1.000000e-50 8.130000e-49        NA -0.8104961 -3.979237   114 6170, 47....
58: 1.541289e-25 6.265340e-24 1.3110148 -0.5696555 -3.029977   191 506, 542....
59: 9.228149e-04 8.525551e-03 0.4772708 -0.3777424 -1.724934    76 8673, 11....
60: 5.950260e-09 1.668125e-07 0.7614608 -0.4594739 -2.295736   125 5689, 57....
61: 3.367637e-43 2.281574e-41 1.7147970 -0.6700502 -3.579434   194 6170, 47....
62: 1.612246e-07 2.912791e-06 0.6901325 -0.4180823 -2.128723   137 6232, 56....
63: 1.570493e-05 2.280020e-04 0.5756103 -0.4016226 -1.930960   104 6632, 66....
64: 4.832716e-08 1.007435e-06 0.7195128 -0.6304098 -2.492947    40 6632, 66....
65: 1.170229e-08 2.891845e-07 0.7477397 -0.5764672 -2.483526    57 5689, 57....
66: 2.234667e-07 3.949531e-06 0.6901325 -0.5415920 -2.374227    61 5689, 57....
67: 6.753795e-09 1.830279e-07 0.7614608 -0.6371275 -2.546433    42 5689, 57....
68: 1.000000e-50 8.130000e-49        NA -0.9278116 -4.237599    78 6170, 47....
69: 9.781723e-07 1.559322e-05 0.6435518 -0.7222901 -2.410892    21 7411, 10....
70: 9.217399e-05 1.157502e-03 0.5384341 -0.3543974 -1.801202   135 6632, 66....
71: 9.655577e-04 8.820207e-03 0.4772708 -0.4813570 -1.833794    35 8802, 63....
72: 2.402736e-07 4.156222e-06 0.6749629 -0.5187248 -2.324521    69 7324, 56....
73: 1.000000e-50 8.130000e-49        NA -0.9080956 -4.271741    87 6170, 47....
74: 1.000000e-50 8.130000e-49        NA -0.9158695 -4.226378    83 6170, 47....
75: 9.280765e-28 4.715789e-26 1.3651796 -0.6082146 -3.174860   171 506, 542....
76: 1.353900e-08 3.144917e-07 0.7477397 -0.6232034 -2.555609    46 5689, 57....
77: 7.437584e-08 1.474818e-06 0.7049757 -0.5958419 -2.473595    48 5689, 57....
78: 2.233452e-08 4.778411e-07 0.7337620 -0.5960809 -2.509898    50 5689, 57....
79: 4.460310e-10 1.450493e-08 0.8140358 -0.6088265 -2.644210    58 5689, 57....
80: 1.616141e-08 3.649784e-07 0.7337620 -0.6217982 -2.549847    46 5689, 57....
81: 9.146997e-07 1.487302e-05 0.6594444 -0.4751757 -2.190291    82 2237, 56....
82: 7.351977e-08 1.474818e-06 0.7049757 -0.4790759 -2.267473    93 2237, 56....
83: 4.635933e-04 4.832068e-03 0.4984931 -0.4834178 -1.911668    40 5440, 51....
84: 4.023557e-05 5.362544e-04 0.5573322 -0.4756486 -2.047261    56 2958, 68....
85: 1.000000e-50 8.130000e-49        NA -0.8250959 -4.065058   112 6170, 47....
86: 3.759356e-24 1.389253e-22 1.2709130 -0.8207037 -3.471127    51 6217, 21....
87: 1.215655e-09 3.801258e-08 0.7881868 -0.6445795 -2.655455    47 6921, 56....
88: 7.163028e-04 6.851226e-03 0.4772708 -0.6992133 -2.015421    12 5440, 29....
89: 1.000000e-50 8.130000e-49        NA -0.9222092 -4.212012    78 6170, 47....
            pval         padj   log2err         ES       NES  size  leadingEdge
significant_pathways_gsets_dysplasia <- gseapermgsets_dysplasia[gseapermgsets_dysplasia$padj < 0.01, ]
significant_pathways_gsets_ordered_dysplasia <- significant_pathways_gsets_dysplasia[order(significant_pathways_gsets_dysplasia$padj), ]
head(significant_pathways_gsets_ordered_dysplasia, n = 10)
                                                             pathway  pval
                                                              <char> <num>
 1:                                                    KEGG_RIBOSOME 1e-50
 2:                REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS 1e-50
 3: REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT 1e-50
 4:       REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 1e-50
 5:                         REACTOME_INSULIN_SYNTHESIS_AND_SECRETION 1e-50
 6:                                REACTOME_PEPTIDE_CHAIN_ELONGATION 1e-50
 7:                     REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT 1e-50
 8:             REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS 1e-50
 9:                                             REACTOME_TRANSLATION 1e-50
10:                                  REACTOME_VIRAL_MRNA_TRANSLATION 1e-50
        padj log2err         ES       NES  size  leadingEdge
       <num>   <num>      <num>     <num> <int>       <list>
 1: 8.13e-49      NA -0.9313928 -4.274983    80 6170, 47....
 2: 8.13e-49      NA -0.9007982 -4.235589    88 6170, 47....
 3: 8.13e-49      NA -0.8774321 -4.197048    99 6170, 47....
 4: 8.13e-49      NA -0.8692742 -4.102250    92 6170, 47....
 5: 8.13e-49      NA -0.8104961 -3.979237   114 6170, 47....
 6: 8.13e-49      NA -0.9278116 -4.237599    78 6170, 47....
 7: 8.13e-49      NA -0.9080956 -4.271741    87 6170, 47....
 8: 8.13e-49      NA -0.9158695 -4.226378    83 6170, 47....
 9: 8.13e-49      NA -0.8250959 -4.065058   112 6170, 47....
10: 8.13e-49      NA -0.9222092 -4.212012    78 6170, 47....

Create histograms of p-values and generate an enrichment plot.

par(mfrow=c(1, 2))
hist(gseapermpheno_dysplasia$pval, xlab="P-value", main="Permuting phenotypes", las=1)
hist(gseapermgsets_dysplasia$pval, xlab="P-value", main="Permuting gene sets", las=1)

plotEnrichment(gsets_dysplasia$CONTROL, stats_dysplasia)

plotEnrichment(gsets_dysplasia$TUMOR, stats_dysplasia)

Identify significant pathways.

DEpwys_dysplasia <- significant_pathways_gsets_ordered_dysplasia$pathway
plotGseaTable(gsets_dysplasia[DEpwys_dysplasia], stats_dysplasia, gseapermgsets_dysplasia, gseaParam=0.5)

We generate an HTML page so we can visualize more easily.

library(kableExtra)
gsea_tab <- kable(significant_pathways_gsets_ordered_dysplasia, "html", caption="GSEA Results Dysplasia")
gsea_tab <- kable_styling(gsea_tab, bootstrap_options=c("stripped", "hover", "responsive"), fixed_thead=TRUE)
save_kable(gsea_tab, file="./results/gsea_results_dysplasia.html", self_contained=TRUE)

See the HTML page by clicking on this link.

Additional validation using Camera.

library(limma)
gsetidx_dysplasia <- ids2indices(geneIds(gsc_dysplasia), rownames(se.filt))
camerares_dysplasia <- camera(y=assays(se.filt)$logCPM, index=gsetidx_dysplasia, design=mod, contrast=2)
head(camerares_dysplasia, 5)
                                                           NGenes Direction
KEGG_RIBOSOME                                                  80      Down
REACTOME_PEPTIDE_CHAIN_ELONGATION                              78      Down
REACTOME_VIRAL_MRNA_TRANSLATION                                78      Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS              88      Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION     92      Down
                                                                 PValue
KEGG_RIBOSOME                                              2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION                          3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION                            1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS          1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 2.731443e-38
                                                                    FDR
KEGG_RIBOSOME                                              1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION                          1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION                            3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS          3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION 4.550585e-36

Filter significant pathways from Camera.

camerares_dysplasia[camerares_dysplasia$FDR < 0.01, ]
                                                                           NGenes
KEGG_RIBOSOME                                                                  80
REACTOME_PEPTIDE_CHAIN_ELONGATION                                              78
REACTOME_VIRAL_MRNA_TRANSLATION                                                78
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                              88
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                     92
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                           83
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT               99
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                      114
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                                   87
REACTOME_TRANSLATION                                                          112
REACTOME_INFLUENZA_LIFE_CYCLE                                                 129
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX     44
REACTOME_DIABETES_PATHWAYS                                                    312
REACTOME_METABOLISM_OF_PROTEINS                                               194
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                              51
KEGG_PARKINSONS_DISEASE                                                       111
KEGG_OXIDATIVE_PHOSPHORYLATION                                                114
REACTOME_ELECTRON_TRANSPORT_CHAIN                                              71
KEGG_ALZHEIMERS_DISEASE                                                       139
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                              140
KEGG_HUNTINGTONS_DISEASE                                                      157
REACTOME_REGULATION_OF_INSULIN_SECRETION                                      171
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                                  47
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                             15
KEGG_PROTEASOME                                                                41
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                                   42
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                                 46
REACTOME_STABILIZATION_OF_P53                                                  46
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                    55
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                            48
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                     191
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                              50
REACTOME_SIGNALING_BY_WNT                                                      58
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC                  21
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                    61
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                     50
REACTOME_GENE_EXPRESSION                                                      397
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                                   56
REACTOME_M_G1_TRANSITION                                                       57
REACTOME_G1_S_TRANSITION                                                       88
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE          69
KEGG_PROTEIN_EXPORT                                                            23
BIOCARTA_ETC_PATHWAY                                                           12
REACTOME_DNA_REPLICATION_PRE_INITIATION                                        68
BIOCARTA_RAB_PATHWAY                                                           10
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                           61
REACTOME_METABOLISM_OF_AMINO_ACIDS                                            125
REACTOME_CELL_CYCLE_CHECKPOINTS                                                97
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC                15
REACTOME_S_PHASE                                                               93
REACTOME_HIV_INFECTION                                                        176
REACTOME_SYNTHESIS_OF_DNA                                                      82
BIOCARTA_PROTEASOME_PATHWAY                                                    19
KEGG_CARDIAC_MUSCLE_CONTRACTION                                                46
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                     116
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                           40
BIOCARTA_SALMONELLA_PATHWAY                                                    10
BIOCARTA_CDC42RAC_PATHWAY                                                      14
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                              19
REACTOME_MITOTIC_M_M_G1_PHASES                                                137
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                         12
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                         36
KEGG_CITRATE_CYCLE_TCA_CYCLE                                                   27
                                                                           Direction
KEGG_RIBOSOME                                                                   Down
REACTOME_PEPTIDE_CHAIN_ELONGATION                                               Down
REACTOME_VIRAL_MRNA_TRANSLATION                                                 Down
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                               Down
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                      Down
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                            Down
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT                Down
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                        Down
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                                    Down
REACTOME_TRANSLATION                                                            Down
REACTOME_INFLUENZA_LIFE_CYCLE                                                   Down
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX      Down
REACTOME_DIABETES_PATHWAYS                                                      Down
REACTOME_METABOLISM_OF_PROTEINS                                                 Down
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                               Down
KEGG_PARKINSONS_DISEASE                                                         Down
KEGG_OXIDATIVE_PHOSPHORYLATION                                                  Down
REACTOME_ELECTRON_TRANSPORT_CHAIN                                               Down
KEGG_ALZHEIMERS_DISEASE                                                         Down
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                                Down
KEGG_HUNTINGTONS_DISEASE                                                        Down
REACTOME_REGULATION_OF_INSULIN_SECRETION                                        Down
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                                   Down
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                              Down
KEGG_PROTEASOME                                                                 Down
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                                    Down
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                                  Down
REACTOME_STABILIZATION_OF_P53                                                   Down
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                     Down
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                             Down
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                       Down
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                               Down
REACTOME_SIGNALING_BY_WNT                                                       Down
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC                   Down
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                     Down
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                      Down
REACTOME_GENE_EXPRESSION                                                        Down
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                                    Down
REACTOME_M_G1_TRANSITION                                                        Down
REACTOME_G1_S_TRANSITION                                                        Down
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE           Down
KEGG_PROTEIN_EXPORT                                                             Down
BIOCARTA_ETC_PATHWAY                                                            Down
REACTOME_DNA_REPLICATION_PRE_INITIATION                                         Down
BIOCARTA_RAB_PATHWAY                                                            Down
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                            Down
REACTOME_METABOLISM_OF_AMINO_ACIDS                                              Down
REACTOME_CELL_CYCLE_CHECKPOINTS                                                 Down
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC                 Down
REACTOME_S_PHASE                                                                Down
REACTOME_HIV_INFECTION                                                          Down
REACTOME_SYNTHESIS_OF_DNA                                                       Down
BIOCARTA_PROTEASOME_PATHWAY                                                     Down
KEGG_CARDIAC_MUSCLE_CONTRACTION                                                 Down
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                       Down
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                            Down
BIOCARTA_SALMONELLA_PATHWAY                                                     Down
BIOCARTA_CDC42RAC_PATHWAY                                                       Down
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                               Down
REACTOME_MITOTIC_M_M_G1_PHASES                                                  Down
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                          Down
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                          Down
KEGG_CITRATE_CYCLE_TCA_CYCLE                                                    Down
                                                                                 PValue
KEGG_RIBOSOME                                                              2.850098e-41
REACTOME_PEPTIDE_CHAIN_ELONGATION                                          3.820858e-41
REACTOME_VIRAL_MRNA_TRANSLATION                                            1.309456e-40
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                          1.886464e-39
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                 2.731443e-38
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                       1.145772e-35
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT           1.225223e-35
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                   1.019429e-33
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                               2.283164e-33
REACTOME_TRANSLATION                                                       1.171285e-32
REACTOME_INFLUENZA_LIFE_CYCLE                                              3.033430e-22
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 5.392902e-22
REACTOME_DIABETES_PATHWAYS                                                 8.692926e-22
REACTOME_METABOLISM_OF_PROTEINS                                            5.349591e-20
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                          9.395454e-20
KEGG_PARKINSONS_DISEASE                                                    1.839192e-15
KEGG_OXIDATIVE_PHOSPHORYLATION                                             3.491358e-15
REACTOME_ELECTRON_TRANSPORT_CHAIN                                          5.005995e-15
KEGG_ALZHEIMERS_DISEASE                                                    3.306168e-12
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                           4.568293e-11
KEGG_HUNTINGTONS_DISEASE                                                   3.248647e-10
REACTOME_REGULATION_OF_INSULIN_SECRETION                                   1.611900e-09
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                              2.700881e-09
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                         1.052427e-08
KEGG_PROTEASOME                                                            1.058222e-08
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                               3.144120e-08
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                             3.740179e-08
REACTOME_STABILIZATION_OF_P53                                              6.522039e-08
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                7.748487e-08
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                        8.120236e-08
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                  9.625664e-08
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                          1.853028e-07
REACTOME_SIGNALING_BY_WNT                                                  1.932560e-07
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC              3.724800e-07
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                4.225662e-07
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                 5.652607e-07
REACTOME_GENE_EXPRESSION                                                   6.425514e-07
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                               7.186981e-07
REACTOME_M_G1_TRANSITION                                                   1.059953e-06
REACTOME_G1_S_TRANSITION                                                   1.682678e-06
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE      1.857432e-06
KEGG_PROTEIN_EXPORT                                                        2.806326e-06
BIOCARTA_ETC_PATHWAY                                                       3.925179e-06
REACTOME_DNA_REPLICATION_PRE_INITIATION                                    4.241129e-06
BIOCARTA_RAB_PATHWAY                                                       7.089697e-06
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                       7.910600e-06
REACTOME_METABOLISM_OF_AMINO_ACIDS                                         1.757114e-05
REACTOME_CELL_CYCLE_CHECKPOINTS                                            1.821139e-05
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC            2.301368e-05
REACTOME_S_PHASE                                                           2.778510e-05
REACTOME_HIV_INFECTION                                                     3.533418e-05
REACTOME_SYNTHESIS_OF_DNA                                                  5.736754e-05
BIOCARTA_PROTEASOME_PATHWAY                                                1.466427e-04
KEGG_CARDIAC_MUSCLE_CONTRACTION                                            1.928235e-04
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                  2.272538e-04
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                       2.721551e-04
BIOCARTA_SALMONELLA_PATHWAY                                                2.757882e-04
BIOCARTA_CDC42RAC_PATHWAY                                                  3.143116e-04
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                          4.647822e-04
REACTOME_MITOTIC_M_M_G1_PHASES                                             4.659876e-04
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                     4.930346e-04
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                     4.994264e-04
KEGG_CITRATE_CYCLE_TCA_CYCLE                                               5.506234e-04
                                                                                    FDR
KEGG_RIBOSOME                                                              1.591387e-38
REACTOME_PEPTIDE_CHAIN_ELONGATION                                          1.591387e-38
REACTOME_VIRAL_MRNA_TRANSLATION                                            3.635921e-38
REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS                          3.928561e-37
REACTOME_INFLUENZA_VIRAL_RNA_TRANSCRIPTION_AND_REPLICATION                 4.550585e-36
REACTOME_REGULATION_OF_GENE_EXPRESSION_IN_BETA_CELLS                       1.458015e-33
REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT           1.458015e-33
REACTOME_INSULIN_SYNTHESIS_AND_SECRETION                                   1.061480e-31
REACTOME_REGULATION_OF_BETA_CELL_DEVELOPMENT                               2.113196e-31
REACTOME_TRANSLATION                                                       9.756803e-31
REACTOME_INFLUENZA_LIFE_CYCLE                                              2.297134e-20
REACTOME_FORMATION_OF_THE_TERNARY_COMPLEX_AND_SUBSEQUENTLY_THE_43S_COMPLEX 3.743573e-20
REACTOME_DIABETES_PATHWAYS                                                 5.570160e-20
REACTOME_METABOLISM_OF_PROTEINS                                            3.183007e-18
REACTOME_TRANSLATION_INITIATION_COMPLEX_FORMATION                          5.217609e-18
KEGG_PARKINSONS_DISEASE                                                    9.575296e-14
KEGG_OXIDATIVE_PHOSPHORYLATION                                             1.710766e-13
REACTOME_ELECTRON_TRANSPORT_CHAIN                                          2.316663e-13
KEGG_ALZHEIMERS_DISEASE                                                    1.449494e-10
REACTOME_GLUCOSE_REGULATION_OF_INSULIN_SECRETION                           1.902694e-09
KEGG_HUNTINGTONS_DISEASE                                                   1.288630e-08
REACTOME_REGULATION_OF_INSULIN_SECRETION                                   6.103241e-08
REACTOME_VIF_MEDIATED_DEGRADATION_OF_APOBEC3G                              9.781886e-08
REACTOME_FORMATION_OF_ATP_BY_CHEMIOSMOTIC_COUPLING                         3.525994e-07
KEGG_PROTEASOME                                                            3.525994e-07
REACTOME_P53_INDEPENDENT_DNA_DAMAGE_RESPONSE                               1.007328e-06
REACTOME_REGULATION_OF_ORNITHINE_DECARBOXYLASE                             1.153915e-06
REACTOME_STABILIZATION_OF_P53                                              1.940307e-06
REACTOME_CYCLIN_E_ASSOCIATED_EVENTS_DURING_G1_S_TRANSITION_                2.225686e-06
REACTOME_SCF_BETA_TRCP_MEDIATED_DEGRADATION_OF_EMI1                        2.254719e-06
REACTOME_INTEGRATION_OF_ENERGY_METABOLISM                                  2.586509e-06
REACTOME_SCF_SKP2_MEDIATED_DEGRADATION_OF_P27_P21                          4.823664e-06
REACTOME_SIGNALING_BY_WNT                                                  4.878251e-06
REACTOME_PREFOLDIN_MEDIATED_TRANSFER_OF_SUBSTRATE_TO_CCT_TRIC              9.125760e-06
REACTOME_CDC20_PHOSPHO_APC_MEDIATED_DEGRADATION_OF_CYCLIN_A                1.005707e-05
REACTOME_CDT1_ASSOCIATION_WITH_THE_CDC6_ORC_ORIGIN_COMPLEX                 1.307950e-05
REACTOME_GENE_EXPRESSION                                                   1.446609e-05
REACTOME_AUTODEGRADATION_OF_CDH1_BY_CDH1_APC                               1.575462e-05
REACTOME_M_G1_TRANSITION                                                   2.263951e-05
REACTOME_G1_S_TRANSITION                                                   3.504176e-05
REACTOME_REGULATION_OF_APC_ACTIVATORS_BETWEEN_G1_S_AND_EARLY_ANAPHASE      3.773759e-05
KEGG_PROTEIN_EXPORT                                                        5.565880e-05
BIOCARTA_ETC_PATHWAY                                                       7.603893e-05
REACTOME_DNA_REPLICATION_PRE_INITIATION                                    8.029228e-05
BIOCARTA_RAB_PATHWAY                                                       1.312382e-04
REACTOME_ORC1_REMOVAL_FROM_CHROMATIN                                       1.432507e-04
REACTOME_METABOLISM_OF_AMINO_ACIDS                                         3.114204e-04
REACTOME_CELL_CYCLE_CHECKPOINTS                                            3.160435e-04
REACTOME_FORMATION_OF_TUBULIN_FOLDING_INTERMEDIATES_BY_CCT_TRIC            3.912325e-04
REACTOME_S_PHASE                                                           4.628998e-04
REACTOME_HIV_INFECTION                                                     5.771249e-04
REACTOME_SYNTHESIS_OF_DNA                                                  9.189839e-04
BIOCARTA_PROTEASOME_PATHWAY                                                2.304781e-03
KEGG_CARDIAC_MUSCLE_CONTRACTION                                            2.974481e-03
REACTOME_HOST_INTERACTIONS_OF_HIV_FACTORS                                  3.441862e-03
REACTOME_MRNA_SPLICING_MINOR_PATHWAY                                       4.030378e-03
BIOCARTA_SALMONELLA_PATHWAY                                                4.030378e-03
BIOCARTA_CDC42RAC_PATHWAY                                                  4.514165e-03
REACTOME_CHOLESTEROL_BIOSYNTHESIS                                          6.469461e-03
REACTOME_MITOTIC_M_M_G1_PHASES                                             6.469461e-03
REACTOME_VIRAL_MESSENGER_RNA_SYNTHESIS                                     6.710035e-03
REACTOME_HIV1_TRANSCRIPTION_INITIATION                                     6.710035e-03
KEGG_CITRATE_CYCLE_TCA_CYCLE                                               7.280464e-03

Visualise significant pathways from Camera for Dysplasia.

hist(camerares_dysplasia$PValue, xlab="P-value", main="Camera", las=1)

5 Discussion

We ended up following two different pipelines and therefore reached two sets of differentially expressed genes and pathways. This proved to be fruitful seeing as both current vs. former smokers and dysplasia vs. no dysplasia have statistically significant differencial expression. The results of the differential expression were significant enough to perform functional analysis on and find out how the genome changes between current smokers vs former smokers, and how does it affect the state of their ariway tissue (dysplasia).

We decided on this aim for our analysis because of the results we got from doing Experimental Data Analysis on the dataset. We looked at all of the phenotypical and technical variables to find something that may be worth of interest or that might output statistical significant results. What we found during Experimental Dessign and batch identification is that our samples clustered together (with some exceptions) in two groups. After some trial and error, we found the reason for this clustering is the divide between the groups of samples that are current smokers vs. former smokers. We went ahead and decided to explore this path.

A second analysis seemed pertinent, on a different variable to be able to compare the two. Dysplasia vs. no dysplasia was an appropriate group, especially since it describes the state of damage of the tissue, which should be related to wether or not someone smokes.

As we suspected during EDA and Experimental Design, the results from our Differential Expression Analysis in Current smokers vs. Former smokers were of high statistical significance. We found a very high number of DE genes, some of the top ones being up until 26 times more upregulated in one group. Our distribution of p-values shows how most of them are very close to 0, and the volcano plot also helps confirm there is significant Differential Gene Expression between samples belonging to former smokers vs. samples belonging to current smokers.

In our analysis, we aimed to explore the effects of smoking on cellular functions using Gene Set Enrichment Analysis (GSEA) and Gene Ontology (GO) analysis. We conducted these analyses on samples with dysplasia versus no dysplasia and current smokers versus former smokers.

Both GSEA and GO analysis in our sets of DE genes (dysplasia vs no dysplasia and current vs former smoker) revealed that smoking significantly impacts various cellular pathways, particularly those related to ribosomal function, protein synthesis, mitochondrial function, and metabolic processes (REACTOME_TRANSLATION, REACTOME_GTP_HYDROLYSIS_AND_JOINING_OF_THE_60S_RIBOSOMAL_SUBUNIT and REACTOME_FORMATION_OF_A_POOL_OF_FREE_40S_SUBUNITS). This illustrates that chronic exposure to cigarette smoke can lead to oxidative stress, damaging cellular components and disrupting normal protein synthesis and function.

Smoking affects the expression and function of ribosomal proteins, which are critical for protein biogenesis (KEGG_RIBOSOME). Our findings align with the study by Park et al. (2021), which also identified a detrimental effect of smoking on ribosomal function and protein synthesis. Exposure to electronic cigarette smoke reduces ribosomal protein gene expression, impairing protein synthesis in primary human airway epithelial cells. The reduced expression of ribosomal proteins and subsequent impairment in protein synthesis could lead to various cellular dysfunctions and contribute to the pathogenesis of various smoking-related diseases, including cancer.

Further findings indicate that smoking disrupts cytoplasmic translation and peptide chain elongation (REACTOME_PEPTIDE_CHAIN_ELONGATION). The study by Merzah et al. (2023) supports these observations, revealing differentially expressed genes (DEGs) related to immune response and inflammation pathways in smokers. Smoking leads to significant downregulation of genes involved in oxygen binding and activity, emphasizing the role of oxidative stress in disrupting cellular functions, contributing to disease development.

Additionally, our analyses identified pathways related to mitochondrial electron transport and ATP synthesis (KEGG_OXIDATIVE_PHOSPHORYLATION). Smoking introduces numerous toxins that induce oxidative stress, resulting in the production of reactive oxygen species (ROS), which can damage cellular components, including mitochondria. This oxidative stress impairs mitochondrial function, leading to a cascade of detrimental effects on cellular health. The study by Kanithi et al. (2022) highlighted the impact of cigarette smoke and e-cigarette vapor on mitochondrial function by increasing oxidative stress, disrupting mitochondrial morphology, and impairing critical processes such as fusion, fission, and mitophagy.

Smoking-induced oxidative stress damages the electron transport chain, leading to reduced ATP production and impaired cellular energy metabolism. The resulting energy deficit can hinder cellular processes, contributing to the development of diseases such as cancer and cardiovascular conditions. The study by Tan et al. (2008) concluded that cigarette smoke exposure affects mtDNA in buccal cells of smokers. Additionally, Fetterman, Sammy, and Ballinger (2017) reported that active and passive cigarette smoke decreases mitochondrial respiration and membrane potential, leading to decreased ATP content and increased oxidant production in various tissues, exacerbating cellular dysfunction and contributing to smoking-related diseases.

Pathways like REACTOME_INSULIN_SYNTHESIS_AND_SECRETION and REACTOME_GLUCOS_REGULATION_OF_INSULIN_SECRETION point to potential disruptions in insulin signaling. Smoking has been associated with insulin resistance (Facchini et al. (1992)) and an increased risk of developing type 2 diabetes (Will et al. (2001)) This could be due to inflammation and oxidative stress induced by smoking, affecting pancreatic beta cells’ ability to produce insulin and cells’ sensitivity to insulin.

GO terms such as tryptophan transport, valine transport, and thyroid hormone transport highlight changes in the transport of essential molecules. Smoking can alter the expression of various transporters on cell membranes, impacting nutrient uptake and hormone distribution. Babić Leko et al. (2021) concluded that smoking leads to a decrease in TSH levels and an increase in T3 and T4 levels. The mechanism through which cigarette smoking affects TSH and thyroid hormone levels remains unclear. Changes in amino acid transport can affect protein synthesis and cellular repair mechanisms.

The GO terms daunorubicin metabolic process and doxorubicin metabolic process suggest differences in how former and current smokers metabolize specific substances. These findings align with research showing that smoking induces certain cytochrome P450 enzymes involved in drug metabolism, leading to altered pharmacokinetics and potentially affecting the efficacy and toxicity of various medications (Zhao et al. (2021)).

Collectively, these findings underscore the broad impact of smoking on critical biological functions, affecting everything from protein synthesis to metabolic and transport processes. The impairment of these pathways highlights the profound and multifaceted consequences of smoking on cellular health, contributing to the development of various smoking-related diseases.

6 Conclusions

Differential Expression and Functional Analysis allowed us to identify differentially expressed genes and pathways between current smokers and former smokers, as well as samples with damaged lung tissue (dysplasia) vs. healthy lung tissue (no dysplasia).

We wanted to gain some insight into the effects of smoking on the airway, and whether or not there is significance in smokers vs. people that quit smoking. The fact that there is very significant differential expression between these groups, and that there are pathways that are differentially expressed, tells us that there are significant changes to the genome once the smoking stops. Comparing it to dysplasia status, and seeing so many pathways that are differentially expressed in one classification as well as the other, could indicate there’s a correlation between Histology degree and the smoking habit.

In our analysis of the effects of smoking on cellular functions using Differential Expression (DE), Gene Ontology (GO), and Gene Set Enrichment Analysis (GSEA) on RNA-seq data, we discovered significant molecular alterations between current and former smokers, and individuals with and without dysplasia.

Even though our process did yield significant results, we think it’s still superficial. We could add to this analysis by also sequencing genome from samples that never smoked, or try and go further into the changes the genome goes through once one stops smoking. We think this is a very relevant topic seeing as smoking is still really prevalent, and lung cancer is one of the leading causes of death in the United States.

7 Session information

sessionInfo()
R version 4.4.1 (2024-06-14)
Platform: aarch64-apple-darwin20
Running under: macOS Sonoma 14.3

Matrix products: default
BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

time zone: America/Los_Angeles
tzcode source: internal

attached base packages:
[1] stats4    stats     graphics  grDevices datasets  utils     methods  
[8] base     

other attached packages:
 [1] dplyr_1.1.4                 msigdbr_7.5.1              
 [3] GSVAdata_1.40.0             hgu95a.db_3.13.0           
 [5] GSEABase_1.66.0             fgsea_1.30.0               
 [7] GO.db_3.19.1                org.Hs.eg.db_3.19.1        
 [9] GOstats_2.70.0              graph_1.82.0               
[11] Category_2.70.0             Matrix_1.7-0               
[13] RColorBrewer_1.1-3          ggplot2_3.5.1              
[15] geneplotter_1.82.0          annotate_1.82.0            
[17] XML_3.99-0.17               AnnotationDbi_1.66.0       
[19] lattice_0.22-6              edgeR_4.2.1                
[21] limma_3.60.3                SummarizedExperiment_1.34.0
[23] Biobase_2.64.0              GenomicRanges_1.56.1       
[25] GenomeInfoDb_1.40.1         IRanges_2.38.1             
[27] S4Vectors_0.42.1            BiocGenerics_0.50.0        
[29] MatrixGenerics_1.16.0       matrixStats_1.3.0          
[31] here_1.0.1                  kableExtra_1.4.0           
[33] knitr_1.48                  BiocStyle_2.32.1           

loaded via a namespace (and not attached):
 [1] bitops_1.0-7            DBI_1.2.3               RBGL_1.80.0            
 [4] rlang_1.1.4             magrittr_2.0.3          compiler_4.4.1         
 [7] RSQLite_2.3.7           png_0.1-8               systemfonts_1.1.0      
[10] vctrs_0.6.5             stringr_1.5.1           pkgconfig_2.0.3        
[13] crayon_1.5.3            fastmap_1.2.0           magick_2.8.4           
[16] XVector_0.44.0          labeling_0.4.3          utf8_1.2.4             
[19] rmarkdown_2.27          UCSC.utils_1.0.0        tinytex_0.51           
[22] bit_4.0.5               xfun_0.45               zlibbioc_1.50.0        
[25] cachem_1.1.0            jsonlite_1.8.8          blob_1.2.4             
[28] highr_0.11              DelayedArray_0.30.1     BiocParallel_1.38.0    
[31] parallel_4.4.1          R6_2.5.1                bslib_0.7.0            
[34] stringi_1.8.4           genefilter_1.86.0       jquerylib_0.1.4        
[37] Rcpp_1.0.12             bookdown_0.40           splines_4.4.1          
[40] tidyselect_1.2.1        rstudioapi_0.16.0       abind_1.4-5            
[43] yaml_2.3.9              codetools_0.2-20        tibble_3.2.1           
[46] withr_3.0.0             KEGGREST_1.44.1         evaluate_0.24.0        
[49] AnnotationForge_1.46.0  survival_3.7-0          xml2_1.3.6             
[52] Biostrings_2.72.1       pillar_1.9.0            BiocManager_1.30.23    
[55] KernSmooth_2.23-24      renv_1.0.7              generics_0.1.3         
[58] RCurl_1.98-1.16         rprojroot_2.0.4         munsell_0.5.1          
[61] scales_1.3.0            xtable_1.8-4            glue_1.7.0             
[64] tools_4.4.1             data.table_1.15.4       locfit_1.5-9.10        
[67] babelgene_22.9          fastmatch_1.1-4         cowplot_1.1.3          
[70] grid_4.4.1              colorspace_2.1-0        GenomeInfoDbData_1.2.12
[73] cli_3.6.3               fansi_1.0.6             S4Arrays_1.4.1         
[76] viridisLite_0.4.2       svglite_2.1.3           Rgraphviz_2.48.0       
[79] gtable_0.3.5            sass_0.4.9              digest_0.6.36          
[82] SparseArray_1.4.8       farver_2.1.2            memoise_2.0.1          
[85] htmltools_0.5.8.1       lifecycle_1.0.4         httr_1.4.7             
[88] statmod_1.5.0           bit64_4.0.5            

References

Babić Leko, M., I. Gunjača, N. Pleić, and T. Zemunik. 2021. “Environmental Factors Affecting Thyroid-Stimulating Hormone and Thyroid Hormone Levels.” International Journal of Molecular Sciences 22 (12): 6521. https://doi.org/10.3390/ijms22126521.
Beane, Jennifer, Sarah A. Mazzilli, Anna M. Tassinari, Gang Liu, Xiaohui Zhang, Hanqiao Liu, Anne Dy Buncio, et al. 2017. “Detecting the Presence and Progression of Premalignant Lung Lesions via Airway Gene Expression.” Clinical Cancer Research 23 (17): 5091–5100. https://doi.org/10.1158/1078-0432.ccr-16-2540.
Facchini, F. S., C. B. Hollenbeck, J. Jeppesen, Y. D. Chen, and G. M. Reaven. 1992. “Insulin Resistance and Cigarette Smoking.” Lancet (London, England) 339 (8802): 1128–30. https://doi.org/10.1016/0140-6736(92)90730-q.
Fetterman, J. L., M. J. Sammy, and S. W. Ballinger. 2017. “Mitochondrial Toxicity of Tobacco Smoke and Air Pollution.” Toxicology 391: 18–33. https://doi.org/10.1016/j.tox.2017.08.002.
Kanithi, M., S. Junapudi, S. I. Shah, A. Matta Reddy, G. Ullah, and B. Chidipi. 2022. “Alterations of Mitochondrial Network by Cigarette Smoking and e-Cigarette Vaping.” Cells 11 (1688): 1688. https://doi.org/10.3390/cells11101688.
Maglott, Donna, Jim Ostell, Kim D Pruitt, and Tatiana Tatusova. 2010. “Entrez Gene: Gene-Centered Information at NCBI.” Nucleic Acids Research 39 (suppl_1): D52–57. https://doi.org/10.1093/nar/gkq1237.
Matthew E. Ritchie, Di Wu, Belinda Phipson. 2015. “Limma Powers Differential Expression Analyses for RNA-Sequencing and Microarray Studies.” Nucleic Acids Research 43 (April): e47. https://doi.org/https://doi.org/10.1093/nar/gkv007.
Merzah, M., S. Póliska, L. Balogh, J. Sándor, I. Szász, S. Natae, and S. Fiatal. 2023. “A Transcriptomic Analysis of Smoking-Induced Gene Expression Alterations in Coronary Artery Disease Patients.” International Journal of Molecular Sciences 24 (18): 13920. https://doi.org/10.3390/ijms241813920.
Park, H.-R., J. Vallarino, M. O’Sullivan, C. Wirth, R. A. Panganiban, G. Webb, M. Shumyatcher, et al. 2021. “Electronic Cigarette Smoke Reduces Ribosomal Protein Gene Expression to Impair Protein Synthesis in Primary Human Airway Epithelial Cells.” Scientific Reports 11 (17517): 17517. https://doi.org/10.1038/s41598-021-97013-z.
Smyth, Gordon K. 2004. “Linear Models and Empirical Bayes Methods for Assessing Differential Expression in Microarray Experiments.” Statistical Applications in Genetics and Molecular Biology 3 (1). https://doi.org/https://doi.org/10.2202/1544-6115.1027.
Tan, D., D. S. Goerlitz, R. G. Dumitrescu, D. Han, F. Seillier-Moiseiwitsch, S. M. Spernak, R. A. Orden, J. Chen, R. Goldman, and P. G. Shields. 2008. “Associations Between Cigarette Smoking and Mitochondrial DNA Abnormalities in Buccal Cells.” Carcinogenesis 29 (6): 1170–77. https://doi.org/10.1093/carcin/bgn034.
Will, J. C., D. A. Galuska, E. S. Ford, A. Mokdad, and E. E. Calle. 2001. “Cigarette Smoking and Diabetes Mellitus: Evidence of a Positive Association from a Large Prospective Cohort Study.” International Journal of Epidemiology 30 (3): 540–46. https://doi.org/10.1093/ije/30.3.540.
Zhao, M., J. Ma, M. Li, Y. Zhang, B. Jiang, X. Zhao, C. Huai, et al. 2021. “Cytochrome P450 Enzymes and Drug Metabolism in Humans.” International Journal of Molecular Sciences 22 (23): 12808. https://doi.org/10.3390/ijms222312808.
Ziemann, Mark, Antony Kaspi, and Assam El-Osta. 2019. “Digital Expression Explorer 2: A Repository of Uniformly Processed RNA Sequencing Data.” GigaScience 8 (4): giz022. https://doi.org/10.1093/gigascience/giz022.